Multimodal agents: vision, text and action
A text-only agent can reason and call APIs — but it cannot see. A multimodal agent adds a visual perception layer: it can look at a screenshot, a diagram, a product image, or a live camera feed, and use that information to decide its next action. This is the foundation of computer-use agents, document understanding systems, and robots that navigate physical environments.
Content is available with subscription.
Get full access to all courses on the platform for one year with a single payment.
▼
Unlike other platforms that charge per course, here you get everything for one price, and after one year of use there will be no automatic charge for the following year.