A language model learns to compute P(next token | all previous tokens). This follows directly from the multiplication rule of probability — the probability of a sequence equals the product of conditional probabilities of each token given everything before it. Models built on this principle are called autoregressive because each output feeds back as input for the next step.
Formally:
Press each step to trace what happens inside the model when it picks the next token:
Softmax: σ(xi)=∑jexjexi. Every output is positive, all outputs sum to 1. Drag the logits below — notice how a small change in one logit shifts probability away from all others. The temperature parameter T divides all logits before softmax: low T sharpens the distribution, high T flattens it.
winner-
takes-allT→∞
uniform
def generate(model, prompt_tokens, max_new=50):
tokens = list(prompt_tokens) # context
for _ in range(max_new):
logits = model(tokens) # [vocab_size]
probs = softmax(logits) # sum = 1.0
next_token = sample(probs) # pick one
tokens.append(next_token) # context grows
if next_token == EOS_TOKEN: break # stop signal
return tokens
Each bar shows the probability the model assigned to that candidate token at that position. The green bar is the one that was sampled and appended to the context.