Vision–language models: LLaVA, GPT-4V-style architectures

Vision-language models: LLaVA, GPT-4V and Gemini

CLIP showed that images and text can live in the same embedding space. Vision-language models (VLMs) take this further: they give a language model the ability to *understand* images — not just match them to captions, but answer detailed questions, reason about spatial relationships, read text in images, and describe complex scenes. GPT-4V, Gemini, and LLaVA all share the same high-level architecture: a visual encoder (usually CLIP-based), a projection layer, and a language model.

Content is available with subscription.

Get full access to all courses on the platform for one year with a single payment.

Unlike other platforms that charge per course, here you get everything for one price, and after one year of use there will be no automatic charge for the following year.