CLIP: training on image–text pairs

CLIP: contrastive learning on image–text pairs

Before CLIP, image classifiers could only answer 'which of these 1000 categories is this?'. They required thousands of labelled examples per category and could not reason about arbitrary concepts. CLIP changed this with a single insight: instead of predicting a fixed set of labels, train a model to understand whether an image and a piece of text describe the same thing. Trained on 400 million image–text pairs scraped from the internet, CLIP learned a shared visual–language embedding space where images and their descriptions are close together.

Content is available with subscription.

Get full access to all courses on the platform for one year with a single payment.

Unlike other platforms that charge per course, here you get everything for one price, and after one year of use there will be no automatic charge for the following year.