Autoregressive models: predicting the next token

ℹ️Token — the minimum unit of text that an LLM processes. A token is not a letter and not a word — it is something in between. The word unbelievable is split by the BPE algorithm into three tokens: `un` + `believ` + `able`. Modern LLMs have vocabularies of 32 000–128 000 tokens. Rule of thumb: 1 token ≈ 4 characters, 100 tokens ≈ 75 English words.

The chain rule: one conditional at a time

A language model learns to compute P(next token | all previous tokens). This follows directly from the multiplication rule of probability — the probability of a sequence equals the product of conditional probabilities of each token given everything before it. Models built on this principle are called autoregressive because each output feeds back as input for the next step.

Formally:

P(t_1, t_2, \ldots, t_n) = \prod_{i=1}^{n} P(t_i \mid t_1, \ldots, t_{i-1})

Each generation step is one conditional probability inference. To generate a 200-token response, the model runs this forward pass 200 times.

One generation step: the full pipeline

Press each step to trace what happens inside the model when it picks the next token:

ℹ️Logits — the raw numbers output by the model's last layer, one per vocabulary token. Logits are not probabilities: they can be any real number, they do not sum to 1, they can be negative. The larger the logit, the more the model "prefers" that token. The softmax function converts logits into a proper probability distribution.

From logits to probabilities: softmax

Softmax: $\sigma(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}$ . Every output is positive, all outputs sum to 1. Drag the logits below — notice how a small change in one logit shifts probability away from all others. The temperature parameter T divides all logits before softmax: low T sharpens the distribution, high T flattens it.

The generation loop in code

Click a line to see its explanation ▼

python

def generate(model, prompt_tokens, max_new=50):

▼

    tokens = list(prompt_tokens)              # context

▼

    for _ in range(max_new):

▼

        logits = model(tokens)               # [vocab_size]

▼

        probs  = softmax(logits)             # sum = 1.0

▼

        next_token = sample(probs)           # pick one

▼

        tokens.append(next_token)            # context grows

▼

        if next_token == EOS_TOKEN: break    # stop signal

▼

    return tokens

Watch the context grow: token by token

Each bar shows the probability the model assigned to that candidate token at that position. The green bar is the one that was sampled and appended to the context.

ℹ️Context window — the maximum number of tokens the model can "see" in one forward pass. GPT-2: 1 024 tokens. GPT-4 Turbo: 128 000. Claude 3: up to 200 000. Everything beyond the window is invisible to the model. Extending context is expensive — transformer attention complexity is O(n²) in sequence length.

Key takeaways

An autoregressive language model generates text by repeatedly predicting the single most likely next token given all previous tokens. Each generation step is one complete forward pass through the model — there is no shortcut.

Token ≈ 4 characters. The BPE algorithm builds a vocabulary of 32 000–128 000 sub-word units

The model predicts P(tᵢ | t₁…tᵢ₋₁) — the chain rule turns this into the probability of any sequence

Pipeline: Context → Transformer → Logits → Softmax → Probabilities → Sample → New token

Logits are raw scores, not probabilities. Softmax converts them: all ≥ 0, sum = 1

Generation is sequential — one forward pass per token. A 200-token reply costs 200 passes