Optimizers: how a neural network reaches the minimum
All optimizers solve the same problem: update weights so loss goes down. The difference is how they use the gradient. SGD is direct: a step opposite to the gradient. Momentum adds inertia: it remembers past steps. Adam is adaptive: it scales the step per parameter from gradient history. That adaptivity is why Adam is the default for transformers.
Content is available with subscription.
Get full access to all courses on the platform for one year with a single payment.
▼
Unlike other platforms that charge per course, here you get everything for one price, and after one year of use there will be no automatic charge for the following year.