Quantization and faster inference: GPTQ, AWQ, speculative decoding
Serving a 70B model in FP16 needs roughly 140GB of GPU memory just for weights — before KV-cache, activations, or batching. Quantisation maps high-precision weights to fewer bits (INT8, INT4, even lower) so the same model fits on cheaper hardware and runs faster on tensor cores built for integer math. This lesson walks through post-training quantisation (PTQ) methods used in production — GPTQ and AWQ — and speculative decoding, a complementary trick that speeds up generation without changing weights.
Content is available with subscription.
Get full access to all courses on the platform for one year with a single payment.
▼
Unlike other platforms that charge per course, here you get everything for one price, and after one year of use there will be no automatic charge for the following year.