5 minute read

“Giving eyes to the brain: How Agents see the world.”

1. Introduction: The Gap Between Text and Light

For the first 20 days of this journey, our agents have been blind. They lived in a text-only universe. They could read “The button is red,” but they could not see a red button.

In the real world, information is visual. A chart on a PDF, a “Submit” button on a REACT website, a blinking LED on a server rack. To build truly autonomous agents that can operate computers or robots, we must bridge the gap between Pixels (Light) and Tokens (Meaning).

This transition is not just about attaching a camera. It requires a fundamental shift in how we process data. Text is discrete (Token 1, Token 2). Images are continuous and massive (1920x1080 pixels x 3 channels ~ 6 million integers). A naive LLM cannot digest 6 million numbers. We need a way to Compress visual reality into semantic vectors.

In this post, we will build the foundation of Vision Agents: The shift from CNNs to Vision Transformers (ViT) and the revolution of CLIP (Contrastive Language-Image Pre-training).


2. From Pixels to Patches: The Architecture

2.1 The Old World: Convolutional Neural Networks (CNNs)

For a decade (2012-2021), Computer Vision was dominated by CNNs (ResNet, EfficientNet).

  • Mechanism: A sliding window (kernel) scans the image, detecting edges, then textures, then shapes, then objects.
  • Limitation: CNNs are Local. A pixel at the top-left only “knows” about its neighbors. It takes many layers for the top-left to influence the bottom-right. This made it hard to understand relationships (“The cat is looking at the dog”).

2.2 The New World: Vision Transformers (ViT)

In 2020, Google changed the game. They asked: “Can we treat an image like a sentence?”

The Process:

  1. Patching: Take an image (e.g., 224x224). Cut it into a grid of squares (16x16 pixels).
    • $224 / 16 = 14$. So we get a $14 \times 14$ grid = 196 patches.
  2. Flattening: Each 16x16 patch is just a bag of pixels (256 pixels x 3 colors = 768 numbers). We flatten this into a 1D vector.
  3. Projection: We project this vector into an embedding space.
    • Analogy: This patch is now a “Word”.
  4. Transformers: We feed these 196 “Visual Words” into a standard Transformer Encoder (Self-Attention).
    • Result: Every patch pays attention to every other patch. The top-left corner allows the bottom-right corner to attend to it immediately.

Why this matters for Agents: ViTs allow us to fuse Vision and Text into a Single Embedding Space. A patch of an image and a token of text are mathematically identical to the transformer.


3. CLIP: The Rosetta Stone

The most important model for Vision Agents isn’t an object detector; it’s CLIP (Contrastive Language-Image Pre-training) by OpenAI.

3.1 The Problem

Before CLIP, models were trained on specific labels.

  • “Is this a cat?” -> Yes/No.
  • “Is this a ‘Submit’ button?” -> I don’t know, I was only trained on cats.

This “Closed Set” vocabulary meant agents couldn’t handle the open web.

3.2 The Solution

OpenAI scraped 400 million (Image, Text) pairs from the internet. They trained two encoders:

  1. Image Encoder (ViT): Turns image -> Vector.
  2. Text Encoder (Transformer): Turns text -> Vector.

The Objective: Maximize the cosine similarity between the correct pair (Dog Image, “Photo of a dog”) and minimize it for incorrect pairs (Dog Image, “Photo of a cat”).

3.3 The Capabilities

This created a Zero-Shot vision system.

  • Agent usage: You can show the agent a screenshot and provide three text options: ["Login Button", "Signup Link", "Background"].
  • Process: Encode the crop of the image. Encode the 3 strings. Calculate Dot Product.
  • Result: The model knows it’s a “Login Button” even if it has never seen that specific website before.

CLIP gives agents “Common Sense” vision.


4. Visual Prompting and Grounding

Having an embedding is nice, but agents need to Act. They need coordinates. “Click at (x=50, y=200)”. Standard CLIP doesn’t give coordinates; it just gives a global “vibe” of the image.

4.1 Grounding (Object Detection)

Grounding is the task of mapping a noun (“Cat”) to a Bounding Box [x1, y1, x2, y2]. Models like Grounding DINO or OWL-ViT are Open-Vocabulary Detectors.

  • Prompt: “Find all ‘Submit Buttons’ in this image.”
  • Output: [[10, 10, 50, 30], [200, 200, 250, 230]].

4.2 Visual Prompting (SAM - Segment Anything Model)

Sometimes boxes aren’t enough. We need the exact pixels (e.g., to select a specific region in Photoshop). SAM takes a “Point Prompt” (a click) and outputs a “Mask” (the exact outline of the object).

Agent Workflow:

  1. User: “Remove the background from the car.”
  2. Agent (Grounding DINO): Finds bounding box of “car”.
  3. Agent (SAM): Uses the box to segment the pixels of the car.
  4. Agent (Image Tool): Deletes pixels outside the mask.

5. Challenges in Vision

Vision is heavy.

  1. Resolution limits: Most models resize images to 224x224 or 512x512.
    • Issue: Steps on a website are often tiny. A 12px “X” close button disappears when you shrink a 1920x1080 screenshot to 224x224.
    • Solution: Tiling. Cut the screenshot into 9 tiles. Process each tile. Stitch results.
  2. Latency: Processing an image takes 10x more compute than text. Real-time vision agents (e.g., playing a video game) require massive GPUs.
  3. Hallucination: Vision models can see things that aren’t there, especially text (OCR errors).

6. Summary

To build a Vision Agent, we don’t train a CNN from scratch. We compose pre-trained foundational blocks:

  • ViT: To process patches.
  • CLIP: To understand semantics (Zero-Shot).
  • Grounding DINO: To find coordinates.
  • SAM: To find precise shapes.

This is the “Hardware” of the visual brain. In the next post, we will look at how Multimodal LLMs (GPT-4V, Gemini Pro Vision) integrate this directly into the reasoning engine.