Sampling strategies: temperature, top-k, top-p, beam search

Sampling strategies: temperature, top-k, top-p and beam search

Once the model has computed a probability distribution over the vocabulary, something must pick the next token. That choice determines whether the output is creative or conservative, diverse or repetitive. This lesson covers the five most common strategies — and when to use each one.

Strategy 1 — Greedy decoding

ℹ️Greedy decoding always picks the token with the highest probability: `next = argmax P(t | context)`. It is deterministic — the same prompt always produces the same output. The problem: greedy is prone to repetitive loops and "safe" but boring completions. By picking the locally best token at each step, it can miss globally better sequences.

Strategy 2 — Temperature

ℹ️Temperature T divides all logits before softmax: `probs = softmax(logits / T)`. At T < 1 the distribution sharpens — the model becomes more confident in the top tokens. At T > 1 it flattens — lower-probability tokens get a bigger share. As T → 0, behavior approaches greedy. Typical values: 0.6–0.8 for precise tasks (code, facts), 1.0–1.2 for creative writing.

Try it: move the sliders and watch the distribution change

The bars below show token probabilities after applying all three filters in order: temperature first, then top-k, then top-p. Grayed-out tokens are excluded from sampling.

Strategy 3 — Top-k sampling

ℹ️Top-k sampling keeps only the k tokens with the highest probability, zeros out the rest, and renormalizes. This cuts the tail of the distribution, preventing the model from ever choosing a very improbable (and usually incoherent) token. Typical values: k = 20–100. Limitation: k is fixed and does not adapt to how confident the model is at each position.

Strategy 4 — Top-p (nucleus) sampling

ℹ️Top-p (nucleus) sampling selects the smallest set of tokens whose cumulative probability ≥ p (typically 0.9–0.95), zeros out the rest, and renormalizes. Unlike top-k, the nucleus size is adaptive: when the model is confident (one token at p = 0.95), the nucleus has 1 token; when uncertain, it may have 50. This makes top-p robust across different confidence levels.

Strategy 5 — Beam search

ℹ️Beam search maintains k best partial sequences (beam width) at each step. Each is expanded to all possible next tokens; the k highest-scoring sequences survive. The winner is the sequence with the highest total log-probability. Beam search is deterministic and popular in machine translation (width 4–8). It produces poor results for open-ended generation — the output is flat, repetitive, and lacks diversity.

Side-by-side comparison

Five strategies — choose by task type

Strategy	Deterministic	Parameter	Best for	Weak point
Greedy	Yes	—	Fast prototyping, exact tasks	Repetitive, local optimum
Temperature	No	T (0.1–2.0)	Creative writing, diversity	High T → incoherence
Top-k	No	k (20–100)	Controlled tail cutoff	k does not adapt to confidence
Top-p	No	p (0.9–0.95)	Adaptive nucleus, chat models	Interacts poorly with high T
Beam search	Yes	width (4–8)	Translation, structured output	Boring open-ended text

💡Practical recipe for chat models: temperature = 0.7–0.8 combined with top-p = 0.9–0.95. OpenAI recommends not using temperature and top-p simultaneously (alter one, leave the other at default), but a moderate combination often works better in practice. For code generation: temperature = 0.2, top-p = 0.95, or just greedy.

Key takeaways

Sampling strategy is a dial between determinism and diversity. No single strategy is best for all tasks — the right choice depends on whether you need exact, creative, or structured output.

Greedy = argmax at each step. Deterministic, fast, repetitive

Temperature T divides logits before softmax: T < 1 sharpens, T > 1 flattens

Top-k cuts the tail to the k most probable tokens — good but not adaptive

Top-p (nucleus) keeps the smallest set summing to p — adaptive to model confidence

Beam search = k best sequences in parallel — excellent for translation, poor for chat

Practical default: temperature 0.7–0.8 + top-p 0.9–0.95