Morning Edition LIVE
Vol. I · No. 1
Est.
MMXXVI

The A.I. Beat

Dispatches from the frontier of machine intelligence
Three
Dollars
← Front page MECHANICS April 17, 2026 · 7 min read
MECHANICS

Tokens, Context Windows, and Temperature — The Mechanics Behind Every LLM Call

A technical but accessible guide to the core parameters that govern LLM behavior: tokenization, context limits, temperature, top-p, and pricing. With real numbers and visual explanations.
Tokens, Context Windows, and Temperature — The Mechanics Behind Every LLM Call
“Every API call is a transaction measured in tokens. Understanding the unit of currency is not optional -- it is the difference between a $50 monthly bill and a $5,000 one.”

Every interaction with a large language model — whether you are chatting with Claude, calling the OpenAI API, or running Llama locally — is governed by a small set of parameters that most users never examine. Tokens determine what you pay. Context windows determine what the model can see. Temperature determines how it chooses its words.

These are not abstract concepts. They are the engineering constraints that shape every AI product you use, and they explain most of the “weird behavior” people encounter. Here is how they actually work.

Tokens: The Atomic Unit of LLMs

Language models do not process text character by character, and they do not process it word by word. They use tokens — subword units generated by a tokenization algorithm, almost always a variant of Byte Pair Encoding (BPE).

BPE works by starting with individual characters, then iteratively merging the most frequent adjacent pairs into new tokens. After thousands of merges, you end up with a vocabulary of typically 32,000 to 100,000 token types that efficiently encode common patterns.

The tokenization is not intuitive. “Hello” is one token. “Unbelievable” is three (“Un”, “believ”, “able”). A space before a word is often part of that word’s token. Numbers are especially tricky — “123456” might be tokenized as [“123”, “456”] or [“12”, “345”, “6”] depending on the tokenizer, which partly explains why LLMs struggle with arithmetic.

The rule of thumb: 1 token is roughly 3/4 of an English word, or about 4 characters. A 1,000-word article is approximately 1,300 tokens. A full novel (80,000 words) is about 106,000 tokens.

Why Tokenization Matters for Your Wallet

Every commercial LLM API charges per token, and the economics are not symmetric — generating tokens (output) costs more than reading them (input), because generation requires running the full model forward pass for each token sequentially, while input tokens can be processed in parallel.

Here is what the major providers charge as of early 2026:

ModelInput (per 1M tokens)Output (per 1M tokens)
GPT-4o$2.50$10.00
GPT-4o mini$0.15$0.60
Claude 3.5 Sonnet$3.00$15.00
Claude 3.5 Haiku$0.80$4.00
Gemini 1.5 Pro$1.25$5.00
Llama 3.1 405B (via Together)$3.50$3.50
DeepSeek V3 (API)$0.27$1.10

These differences compound fast. A customer support application processing 10 million tokens per day — not unusual for a mid-size company — would pay $25/day with GPT-4o input, but $2.70/day with DeepSeek. Over a year, that is the difference between $9,125 and $986. Choosing your model and optimizing your prompts is not an engineering nicety; it is a business decision.

Context Windows: The Model’s Working Memory

The context window is the total number of tokens the model can process in a single call — your input (system prompt + conversation history + current message) plus its generated output, combined. It is a hard ceiling, not a suggestion.

The progression has been dramatic. GPT-3’s 4,096-token window could hold roughly 3,000 words — a few pages. Claude 3’s 200K window holds about 150,000 words — an entire book. Google’s Gemini 1.5 Pro accepts 1 million tokens, enough for approximately 750,000 words or the complete works of Shakespeare with room to spare.

Why Context Windows Have Limits

The limit is not arbitrary — it stems from the self-attention mechanism at the heart of the transformer architecture. Standard self-attention has O(n^2) computational complexity, where n is the sequence length. Doubling the context window quadruples the computation in the attention layers.

A model processing 4,096 tokens computes roughly 16.8 million attention interactions per layer. At 128K tokens, that number balloons to 16.4 billion per layer. At 1M tokens, it is 1 trillion.

This is why researchers have developed techniques like:

  • Sliding window attention (used in Mistral) — each token only attends to a fixed window of nearby tokens, reducing complexity to O(n).
  • Grouped-query attention (GQA) — shares key/value heads across multiple query heads, reducing the KV cache memory by 4-8x. Used in Llama 3 and most modern models.
  • Ring attention — distributes the context across multiple GPUs, with each GPU handling a segment and passing information in a ring topology.
  • KV cache quantization — compresses the stored key/value pairs from 16-bit to 8-bit or 4-bit, halving or quartering memory requirements.

The “Lost in the Middle” Problem

Having a large context window does not mean the model uses all of it equally well. A 2023 Stanford/Berkeley paper (“Lost in the Middle”) showed that models perform best when relevant information is at the beginning or end of the context, and worst when it is buried in the middle. This has improved with each model generation, but it is not fully solved — if you are building RAG applications, put the most relevant retrieved documents first.

Temperature: Controlling Randomness

When a model generates text, it does not pick one word and move on. At each step, it computes a probability distribution over its entire vocabulary — every possible next token gets a probability. Temperature controls the shape of that distribution before sampling.

The math is simple. Before sampling, the model’s raw output scores (logits) are divided by the temperature value, then passed through softmax. Temperature = 1.0 leaves the distribution unchanged. Lower values sharpen it (making the top choice more dominant). Higher values flatten it (giving unlikely tokens a better chance).

Temperature 0 (or near-zero): The model always picks the highest-probability token. Outputs are deterministic and repetitive. Use this for factual questions, data extraction, classification — anywhere creativity is a liability.

Temperature 0.5-0.7: The sweet spot for most applications. Enough randomness to avoid robotic repetition, enough constraint to stay coherent. Most chat products default to this range.

Temperature 1.0+: The model takes risks. Good for brainstorming, creative writing, generating diverse options. Above 1.5, outputs often become incoherent.

Top-p (Nucleus Sampling)

Top-p provides a different knob for controlling randomness. Instead of rescaling the entire distribution, it truncates it. With top-p = 0.9, the model sorts tokens by probability, takes the smallest set whose cumulative probability reaches 90%, and samples only from those tokens.

The effect: top-p prevents the model from ever choosing extremely improbable tokens (the long tail), while still allowing variety among the plausible options. Temperature adjusts the shape; top-p adjusts the cutoff.

Most APIs let you set both, but in practice you should tune one and leave the other at its default. Setting both simultaneously makes the interaction between them hard to predict.

System Prompts: The Hidden Instructions

The system prompt is a message injected at the beginning of the context, before the user’s input, that sets the model’s behavior. It is the reason ChatGPT introduces itself as “ChatGPT” and refuses certain requests, and why Claude says “I’d be happy to help.”

System prompts consume tokens from your context window. Anthropic’s default system prompt for Claude is approximately 1,200 tokens. If you are building an application with a detailed system prompt (persona, rules, examples, few-shot demonstrations), this can easily reach 2,000-4,000 tokens — a meaningful chunk of a smaller context window.

A useful optimization: move static context (documentation, rules) into the system prompt rather than repeating it in every user message, since many providers cache system prompts across calls, reducing latency and cost.

Hallucinations: The Confidence Problem

Hallucination is not a bug in the traditional sense — it is a direct consequence of how these models work. The model generates the highest-probability continuation of the text. If the training data contained no information about a topic, the model does not say “I don’t know” by default (that behavior must be trained in via RLHF). Instead, it generates the most plausible-sounding text, which may be factually wrong.

Common hallucination patterns:

  • Citation fabrication — generating realistic-looking but nonexistent paper titles, authors, and DOIs
  • Confident numerical errors — stating a specific (wrong) number with no hedging
  • Plausible API methods — inventing function signatures that look right but do not exist
  • Temporal confusion — mixing up dates, attributing events to the wrong year

The rate has decreased substantially with each model generation. GPT-4 hallucinated significantly less than GPT-3.5. Claude 3.5 Sonnet less than Claude 3 Opus. But the rate is not zero, and for applications where factual accuracy is critical (medical, legal, financial), independent verification remains essential.

Embeddings: Meaning as Geometry

Embeddings convert text into dense numerical vectors — typically 1,536 to 3,072 dimensions — where geometric proximity corresponds to semantic similarity. “King” and “queen” are close together. “King” and “refrigerator” are far apart.

This enables semantic search: instead of matching keywords, you match meaning. A search for “how to fix a broken pipe” will match documents about plumbing repairs even if they never use the word “fix” or “broken.”

Embedding models are separate from generation models and much cheaper to run. OpenAI’s text-embedding-3-small costs $0.02 per million tokens — 125x cheaper than GPT-4o input pricing. They are the backbone of retrieval-augmented generation (RAG) systems, recommendation engines, and clustering applications.

The Practical Upshot

These parameters are not academic trivia. They are the control surface of a technology that is rapidly becoming infrastructure.

If you are building on LLM APIs: optimize your token usage (shorter prompts, smaller models for simple tasks, prompt caching), choose your context window based on actual needs (128K costs more than 4K), and set temperature deliberately based on your use case.

If you are evaluating AI products: ask what model they use, what context window they provision, and how they handle conversations that exceed it. The answers will tell you more about the product’s quality than any marketing page.

If you are just using chat interfaces: know that the “forgetfulness” you experience in long conversations is not a flaw in the AI — it is the context window filling up and old messages being dropped. Start a new conversation for new topics. Put important context at the beginning of your message, not buried in the middle.

The models will keep getting better. The parameters will remain the same. Understanding them is a durable investment.

large language models explainers