BERT vs GPT: one Transformer, two training objectives
Self-attention can connect any positions — the mask decides what is allowed. In BERT-style encoders the attention mask is (almost) all-ones: for MLM the model sees left and right of [MASK] — bidirectional context. In GPT-style decoders, token t must not attend to future tokens: a causal mask keeps only the lower triangle. Use the widget: hover BERT tokens for the curves; on the GPT side press Next token and switch to Attention mask.
Content is available with subscription.
Get full access to all courses on the platform for one year with a single payment.
▼
Unlike other platforms that charge per course, here you get everything for one price, and after one year of use there will be no automatic charge for the following year.