Every interaction with a large language model — whether you are chatting with Claude, calling the OpenAI API, or running Llama locally — is governed by a small set of parameters that most users never examine. Tokens determine what you pay. Context windows determine what the model can see. Temperature determines how it chooses its words.
These are not abstract concepts. They are the engineering constraints that shape every AI product you use, and they explain most of the “weird behavior” people encounter. Here is how they actually work.
Language models do not process text character by character, and they do not process it word by word. They use tokens — subword units generated by a tokenization algorithm, almost always a variant of Byte Pair Encoding (BPE).
BPE works by starting with individual characters, then iteratively merging the most frequent adjacent pairs into new tokens. After thousands of merges, you end up with a vocabulary of typically 32,000 to 100,000 token types that efficiently encode common patterns.
The tokenization is not intuitive. “Hello” is one token. “Unbelievable” is three (“Un”, “believ”, “able”). A space before a word is often part of that word’s token. Numbers are especially tricky — “123456” might be tokenized as [“123”, “456”] or [“12”, “345”, “6”] depending on the tokenizer, which partly explains why LLMs struggle with arithmetic.
The rule of thumb: 1 token is roughly 3/4 of an English word, or about 4 characters. A 1,000-word article is approximately 1,300 tokens. A full novel (80,000 words) is about 106,000 tokens.
Every commercial LLM API charges per token, and the economics are not symmetric — generating tokens (output) costs more than reading them (input), because generation requires running the full model forward pass for each token sequentially, while input tokens can be processed in parallel.
Here is what the major providers charge as of early 2026:
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| GPT-4o | $2.50 | $10.00 |
| GPT-4o mini | $0.15 | $0.60 |
| Claude 3.5 Sonnet | $3.00 | $15.00 |
| Claude 3.5 Haiku | $0.80 | $4.00 |
| Gemini 1.5 Pro | $1.25 | $5.00 |
| Llama 3.1 405B (via Together) | $3.50 | $3.50 |
| DeepSeek V3 (API) | $0.27 | $1.10 |
These differences compound fast. A customer support application processing 10 million tokens per day — not unusual for a mid-size company — would pay $25/day with GPT-4o input, but $2.70/day with DeepSeek. Over a year, that is the difference between $9,125 and $986. Choosing your model and optimizing your prompts is not an engineering nicety; it is a business decision.
The context window is the total number of tokens the model can process in a single call — your input (system prompt + conversation history + current message) plus its generated output, combined. It is a hard ceiling, not a suggestion.
The progression has been dramatic. GPT-3’s 4,096-token window could hold roughly 3,000 words — a few pages. Claude 3’s 200K window holds about 150,000 words — an entire book. Google’s Gemini 1.5 Pro accepts 1 million tokens, enough for approximately 750,000 words or the complete works of Shakespeare with room to spare.
The limit is not arbitrary — it stems from the self-attention mechanism at the heart of the transformer architecture. Standard self-attention has O(n^2) computational complexity, where n is the sequence length. Doubling the context window quadruples the computation in the attention layers.
A model processing 4,096 tokens computes roughly 16.8 million attention interactions per layer. At 128K tokens, that number balloons to 16.4 billion per layer. At 1M tokens, it is 1 trillion.
This is why researchers have developed techniques like:
Having a large context window does not mean the model uses all of it equally well. A 2023 Stanford/Berkeley paper (“Lost in the Middle”) showed that models perform best when relevant information is at the beginning or end of the context, and worst when it is buried in the middle. This has improved with each model generation, but it is not fully solved — if you are building RAG applications, put the most relevant retrieved documents first.
When a model generates text, it does not pick one word and move on. At each step, it computes a probability distribution over its entire vocabulary — every possible next token gets a probability. Temperature controls the shape of that distribution before sampling.
The math is simple. Before sampling, the model’s raw output scores (logits) are divided by the temperature value, then passed through softmax. Temperature = 1.0 leaves the distribution unchanged. Lower values sharpen it (making the top choice more dominant). Higher values flatten it (giving unlikely tokens a better chance).
Temperature 0 (or near-zero): The model always picks the highest-probability token. Outputs are deterministic and repetitive. Use this for factual questions, data extraction, classification — anywhere creativity is a liability.
Temperature 0.5-0.7: The sweet spot for most applications. Enough randomness to avoid robotic repetition, enough constraint to stay coherent. Most chat products default to this range.
Temperature 1.0+: The model takes risks. Good for brainstorming, creative writing, generating diverse options. Above 1.5, outputs often become incoherent.
Top-p provides a different knob for controlling randomness. Instead of rescaling the entire distribution, it truncates it. With top-p = 0.9, the model sorts tokens by probability, takes the smallest set whose cumulative probability reaches 90%, and samples only from those tokens.
The effect: top-p prevents the model from ever choosing extremely improbable tokens (the long tail), while still allowing variety among the plausible options. Temperature adjusts the shape; top-p adjusts the cutoff.
Most APIs let you set both, but in practice you should tune one and leave the other at its default. Setting both simultaneously makes the interaction between them hard to predict.
The system prompt is a message injected at the beginning of the context, before the user’s input, that sets the model’s behavior. It is the reason ChatGPT introduces itself as “ChatGPT” and refuses certain requests, and why Claude says “I’d be happy to help.”
System prompts consume tokens from your context window. Anthropic’s default system prompt for Claude is approximately 1,200 tokens. If you are building an application with a detailed system prompt (persona, rules, examples, few-shot demonstrations), this can easily reach 2,000-4,000 tokens — a meaningful chunk of a smaller context window.
A useful optimization: move static context (documentation, rules) into the system prompt rather than repeating it in every user message, since many providers cache system prompts across calls, reducing latency and cost.
Hallucination is not a bug in the traditional sense — it is a direct consequence of how these models work. The model generates the highest-probability continuation of the text. If the training data contained no information about a topic, the model does not say “I don’t know” by default (that behavior must be trained in via RLHF). Instead, it generates the most plausible-sounding text, which may be factually wrong.
Common hallucination patterns:
The rate has decreased substantially with each model generation. GPT-4 hallucinated significantly less than GPT-3.5. Claude 3.5 Sonnet less than Claude 3 Opus. But the rate is not zero, and for applications where factual accuracy is critical (medical, legal, financial), independent verification remains essential.
Embeddings convert text into dense numerical vectors — typically 1,536 to 3,072 dimensions — where geometric proximity corresponds to semantic similarity. “King” and “queen” are close together. “King” and “refrigerator” are far apart.
This enables semantic search: instead of matching keywords, you match meaning. A search for “how to fix a broken pipe” will match documents about plumbing repairs even if they never use the word “fix” or “broken.”
Embedding models are separate from generation models and much cheaper to run. OpenAI’s text-embedding-3-small costs $0.02 per million tokens — 125x cheaper than GPT-4o input pricing. They are the backbone of retrieval-augmented generation (RAG) systems, recommendation engines, and clustering applications.
These parameters are not academic trivia. They are the control surface of a technology that is rapidly becoming infrastructure.
If you are building on LLM APIs: optimize your token usage (shorter prompts, smaller models for simple tasks, prompt caching), choose your context window based on actual needs (128K costs more than 4K), and set temperature deliberately based on your use case.
If you are evaluating AI products: ask what model they use, what context window they provision, and how they handle conversations that exceed it. The answers will tell you more about the product’s quality than any marketing page.
If you are just using chat interfaces: know that the “forgetfulness” you experience in long conversations is not a flaw in the AI — it is the context window filling up and old messages being dropped. Start a new conversation for new topics. Put important context at the beginning of your message, not buried in the middle.
The models will keep getting better. The parameters will remain the same. Understanding them is a durable investment.
One email at dawn. The five stories that mattered, with the bits removed and the meaning kept. Free, for now.