Foundations 5 min

Tokens, models, and what it actually costs

How AI is priced, where your money goes, and how to cap your spend before it surprises you.

AI assistants are not free. Chat apps come with monthly subscriptions; APIs and agents charge per use. Three concepts let you predict and control your bill.

Tokens — what you're actually paying for

Models don't think in words; they think in tokens. A token is roughly a word or a chunk of one. "Hello" is one token. "Onomatopoeia" is four. English text is ~0.75 words per token. Code averages ~3 characters per token.

Every request charges you for two things:

Input tokens. Your prompt + any history + system prompt + files.
Output tokens. The reply the model writes back.

Output is usually 4–5× more expensive than input per token.

The three pricing tiers

Tier	Use for	Rough cost per 1M tokens (in / out)
Cheap & fast Haiku, GPT-mini, Gemini Flash	Routine work, classification, simple lookups, summarisation	$0.25–1 / $1–4
Balanced Sonnet, GPT, Gemini Pro	Default for coding CLIs, agents, writing	$3–5 / $15–20
Heavy reasoning Opus, GPT-large, Gemini Ultra	Architecture, research, gnarly debugging	$15 / $75

Pricing moves fast — these are 2026 ballparks. Check each provider's pricing page for current numbers.

What workflows actually cost

One chat reply

~500 tokens in, ~500 out

≈ $0.001–0.01

A coding-CLI session

30 min, lots of file reads

≈ $0.50–3

Cloud agent (one PR)

Reads repo, edits files, runs tests

≈ $1–10

Background daemon (1 day)

Scheduled tasks every hour

≈ $5–50

Three ways to spend less

Prompt caching. If you send the same context (system prompt, project files) repeatedly, modern APIs cache it — sometimes 80%+ off on repeated input. Coding CLIs do this automatically; verify by checking your usage dashboard.
Model routing. Use the cheap tier for routine work, escalate to balanced for hard problems, reserve heavy reasoning for the rare gnarly thing. A good tool routes for you; otherwise pick deliberately.
Hard spend caps. Every provider's dashboard lets you set a monthly limit. Set one before you start. Treat it as a smoke alarm.

Where to check your spend

Provider	Where
Anthropic	console.anthropic.com → Usage
OpenAI	platform.openai.com → Usage
Google	aistudio.google.com → Billing
OpenRouter	openrouter.ai/activity — per model, per request

The math you actually need. Chat: pennies per message. Coding CLI: dollars per session. Always-on agent: tens of dollars per day. Set a monthly cap that's uncomfortable if hit — so an accidental runaway never costs you more than that.

Why is output so much more expensive than input?

Inputs can be batched and processed in parallel; outputs are generated one token at a time and require the model's full attention. The economics of GPU compute reflect that asymmetry.

What about flat-rate subscriptions (Claude.ai Pro, ChatGPT Plus)?

Different model. You pay a fixed monthly fee for human-scale chat usage; you don't see token counts. Cheaper if you're a heavy individual user; the API is cheaper for programmatic / agent use.

What about local / open models?

Free per token after you pay for hardware. Llama, Mistral, Qwen, DeepSeek all have downloadable weights. Quality is closing the gap but still lags the frontier closed models on hard tasks. Privacy and zero-marginal-cost are the wins.