Sampling strategies: temperature, top-k, top-p and beam search
Once the model has computed a probability distribution over the vocabulary, something must pick the next token. That choice determines whether the output is creative or conservative, diverse or repetitive. This lesson covers the five most common strategies — and when to use each one.
Strategy 1 — Greedy decoding
ℹ️Greedy decoding always picks the token with the highest probability: `next = argmax P(t | context)`. It is deterministic — the same prompt always produces the same output. The problem: greedy is prone to repetitive loops and "safe" but boring completions. By picking the locally best token at each step, it can miss globally better sequences.
Strategy 2 — Temperature
ℹ️Temperature T divides all logits before softmax: `probs = softmax(logits / T)`. At T < 1 the distribution sharpens — the model becomes more confident in the top tokens. At T > 1 it flattens — lower-probability tokens get a bigger share. As T → 0, behavior approaches greedy. Typical values: 0.6–0.8 for precise tasks (code, facts), 1.0–1.2 for creative writing.
Try it: move the sliders and watch the distribution change
The bars below show token probabilities after applying all three filters in order: temperature first, then top-k, then top-p. Grayed-out tokens are excluded from sampling.
Move the sliders — watch how the distribution changes ▶
SAMPLING STRATEGIES
Temperature
T = 1.0
Top-k (0 = off)
k = off
Top-p (1.0 = off)
p = off
Active tokens8
Entropy2.09 bits
Top tokenroof (47%)
roof
47%
sofa
26%
floor
12%
window
6%
table
4%
bed
3%
water
1%
yard
1%
Strategy 3 — Top-k sampling
ℹ️Top-k sampling keeps only the k tokens with the highest probability, zeros out the rest, and renormalizes. This cuts the tail of the distribution, preventing the model from ever choosing a very improbable (and usually incoherent) token. Typical values: k = 20–100. Limitation: k is fixed and does not adapt to how confident the model is at each position.
Strategy 4 — Top-p (nucleus) sampling
ℹ️Top-p (nucleus) sampling selects the smallest set of tokens whose cumulative probability ≥ p (typically 0.9–0.95), zeros out the rest, and renormalizes. Unlike top-k, the nucleus size is adaptive: when the model is confident (one token at p = 0.95), the nucleus has 1 token; when uncertain, it may have 50. This makes top-p robust across different confidence levels.
Strategy 5 — Beam search
ℹ️Beam search maintains k best partial sequences (beam width) at each step. Each is expanded to all possible next tokens; the k highest-scoring sequences survive. The winner is the sequence with the highest total log-probability. Beam search is deterministic and popular in machine translation (width 4–8). It produces poor results for open-ended generation — the output is flat, repetitive, and lacks diversity.
Side-by-side comparison
Five strategies — choose by task type
| Strategy | Deterministic | Parameter | Best for | Weak point |
|---|---|---|---|---|
| Greedy | Yes | — | Fast prototyping, exact tasks | Repetitive, local optimum |
| Temperature | No | T (0.1–2.0) | Creative writing, diversity | High T → incoherence |
| Top-k | No | k (20–100) | Controlled tail cutoff | k does not adapt to confidence |
| Top-p | No | p (0.9–0.95) | Adaptive nucleus, chat models | Interacts poorly with high T |
| Beam search | Yes | width (4–8) | Translation, structured output | Boring open-ended text |
💡Practical recipe for chat models: temperature = 0.7–0.8 combined with top-p = 0.9–0.95. OpenAI recommends not using temperature and top-p simultaneously (alter one, leave the other at default), but a moderate combination often works better in practice. For code generation: temperature = 0.2, top-p = 0.95, or just greedy.
Key takeaways
Sampling strategy is a dial between determinism and diversity. No single strategy is best for all tasks — the right choice depends on whether you need exact, creative, or structured output.
Greedy = argmax at each step. Deterministic, fast, repetitive
Temperature T divides logits before softmax: T < 1 sharpens, T > 1 flattens
Top-k cuts the tail to the k most probable tokens — good but not adaptive
Top-p (nucleus) keeps the smallest set summing to p — adaptive to model confidence
Beam search = k best sequences in parallel — excellent for translation, poor for chat
Practical default: temperature 0.7–0.8 + top-p 0.9–0.95