Token Economics: How to Cut LLM Cost Without Making Your Product Worse

Share:

Words you need

  • Token – A small piece of text the AI reads or writes (roughly a word or part of a word). You pay per token. Long answers and long context cost more.
  • LLM – Large Language Model. The AI that generates text. When we say "the model", we mean this.
  • RAG – Retrieval-Augmented Generation. The system fetches the right pieces of content first, then the AI answers using those pieces. More pieces in the prompt means more cost.
  • Chunk – A piece of a document you feed into the AI as context. "Retrieve 12 chunks" means you're putting 12 pieces in the prompt; each one costs tokens.
  • Embedding – A list of numbers that represents the meaning of a piece of text. Used to search for similar content. Computing and storing embeddings costs money; caching them saves.
  • Cache – A stored result you reuse instead of computing again. Cache key = the exact conditions (e.g. which prompt version, which model) that must match for the stored result to be valid.
  • Model routing – Sending an easy request to a smaller, cheaper model and only using the big, expensive model when needed.
  • Streaming – Sending the answer to the user piece by piece as it's generated, instead of waiting for the full answer. Improves perceived speed; can reduce retries and thus cost.

Most teams don't know the unit cost of their own product.

They ship a "helpful assistant", it becomes popular, and then finance shows up with a face like "what is this GPU bill?".

I've been there. Token economics is not a finance topic. It's an engineering topic. If you can't control token usage, you can't scale.

What you'll learn

  • Why the first cost lever is almost always capping output and tightening retrieval (not buying more GPUs).
  • What to cache (embeddings, retrieval results, answers) and why cache keys must include prompt, model, and retrieval version.
  • Model routing: when to use a smaller model first and escalate only when needed.
  • A simple "token budget" per mode (max prompt/output tokens, max chunks) so your system can enforce guardrails.

For the full map from concept to production, see the LLM Handbook series map. For how retrieval and chunk count affect cost, chunking and RAG at inference are the right levers.

The fastest cost reduction: cap output

Long answers are expensive.

Often unnecessary.

So start with:

  • default short answers
  • "expand" button for longer answers
  • strict max output tokens per mode

You'll be shocked how often users prefer shorter, clearer responses.

Retrieval is a cost lever too

RAG increases prompt length.

If you retrieve 12 chunks every time, you're paying for 12 chunks every time.

So make retrieval smarter:

  • filter by metadata before vector search
  • retrieve fewer candidates
  • rerank and keep only the best
  • remove duplicate or near-identical chunks so you don't pay for the same content twice

RAG quality and cost are tied. Sloppy retrieval is expensive and wrong.

Key takeaway: Fewer, better chunks in the prompt cut cost and often improve answers. Filter and rerank before you add more GPUs.

Cache the right things

Caching isn't only "response caching".

You can cache:

  • embeddings for repeated docs
  • retrieval results for repeated queries
  • final answers for repeated questions

But be careful: caching without versioning is how you serve old, wrong answers confidently. If you change the prompt or the retrieval rules and don't change the cache key, users get the previous answer.

Cache keys should include:

  • prompt version
  • retrieval policy version
  • model ID

If you can't do that, don't cache. You'll create "ghost bugs": bugs that only happen when the cache returns an answer that no longer matches the current system.

Key takeaway: Cache keys must include everything that could change the answer (prompt version, model, retrieval version). Otherwise you're saving money at the cost of wrong answers.

Use model routing (cheap first, expensive only when needed)

This is a production pattern that saves money fast:

  • try a smaller/cheaper model for easy prompts
  • escalate to the bigger model when confidence is low

Confidence can be estimated by:

  • retrieval score
  • output validation failures
  • self-check prompts (careful, adds tokens)

This is not "AI magic". It's just tiered services: cheap tier first, expensive tier only when needed.

Streaming is UX and cost control

Streaming doesn't always reduce cost, but it reduces how long users feel they're waiting (perceived latency).

Perceived latency affects:

  • how long users wait
  • how many retries they do
  • how much they spam the button

And retries are a cost multiplier.

So a better UX can literally reduce spend.

Log token usage like a first-class metric

Every response should log: prompt tokens, output tokens, retrieval chunk count, latency (how long the request took), model ID, and feature or mode.

Then you can ask:

  • which feature is burning cost?
  • which user cohort is expensive?
  • which prompt version increased tokens?

This is how you go from "we feel it's expensive" to "this endpoint is 62% of spend".

A simple "token budget" per mode

I like budgets that are explicit. You don't have to write this yourself yet; it's what the idea looks like in code:

Ts
export interface ModeBudget {
  mode: "chat" | "grounded_answer" | "rewrite";
  maxPromptTokens: number;
  maxOutputTokens: number;
  maxChunks: number;
}

Each mode (e.g. chat vs grounded answer) gets a max for prompt tokens, output tokens, and chunks. Now your system can enforce it. Budgets are not limitations. They're guardrails.

Try this: Measure your current token usage distribution. Then do one change: reduce retrieved chunks, cap output tokens, or add caching with versioned keys. Pick one, ship it, measure again.

If you had to cut your LLM bill by 40% this month, what would you change first: retrieval size, output length, or model routing?

Comments

Please log in to post a comment.

No comments yet.

Be the first to comment!