Encoder vs decoder: BERT-style vs GPT-style models

BERT vs GPT: one Transformer, two training objectives

Self-attention can connect any positions — the mask decides what is allowed. In BERT-style encoders the attention mask is (almost) all-ones: for MLM the model sees left and right of [MASK] — bidirectional context. In GPT-style decoders, token t must not attend to future tokens: a causal mask keeps only the lower triangle. Use the widget: hover BERT tokens for the curves; on the GPT side press Next token and switch to Attention mask.

Content is available with subscription.

Get full access to all courses on the platform for one year with a single payment.

Unlike other platforms that charge per course, here you get everything for one price, and after one year of use there will be no automatic charge for the following year.