7 minute read

“Waiting 10 seconds for a thoughtful answer is okay. Waiting 10 seconds for a blank screen is broken.”

TL;DR

Streaming transforms the agent UX from “wait and hope” to progressive disclosure. Server-Sent Events (SSE) are the standard for LLM streaming – simpler than WebSockets and friendlier to firewalls and load balancers. The core engineering challenge is leakage control: building a state machine that streams final answer tokens while hiding raw tool calls and intermediate reasoning. In production, edge buffering from Nginx or Cloudflare silently kills streaming unless explicitly disabled. Visual status updates (“Searching Google…”) reduce perceived latency even when actual latency is high. For a deeper look at real-time pipeline architecture including backpressure and the Actor Model, see Real-Time Agent Pipelines.

A small industrial conveyor belt emerging from behind a dark enclosure through rubber strip curtains

1. Introduction

Human conversation is streamed. We start processing the first word of a sentence before the speaker finishes the paragraph. Early LLM applications waited for the full generation (Stop Token) to complete before sending a JSON response.

  • Old Way: Request -> Wait 15s -> Show 500 words. (User thinks app crashed).
  • New Way: Request -> Wait 0.5s -> Show “The”… “quick”… “brown”…

For AI Agents, streaming is harder because they have “Thought Steps” (internal monologues) that the user shouldn’t see, interspersed with “Final Answers”.


2. Core Concepts: Protocols

How do we push data to the browser?

  1. Short Polling: Client asks “Done?” every 1s. (Inefficient).
  2. WebSockets: Bi-directional, full-duplex TCP. Good for real-time gaming, overkill for chat. Hard to load balance (stateful connections).
  3. Server-Sent Events (SSE): The standard for LLMs.
    • One-way HTTP connection.
    • Server keeps socket open and pushes data: ... chunks.
    • Simple to implement with standard Load Balancers.

3. Architecture Patterns: The Stream Transformer

We need an architecture that transforms raw LLM tokens into structured Agent Events.

[LLM (OpenAI)]
 | (Stream of Tokens)
 v
[Agent Parser Logic]
 | Detects: "Action: Search" -> WAIT
 | Detects: "Observation: 42" -> WAIT
 | Detects: "Final Answer: The..." -> STREAM
 v
[Frontend (React)]

The key challenge: Leakage. We don’t want to stream the raw JSON braces of a tool call to the user. We only want to stream the “final_answer”.


4. Implementation Approaches

4.1 The Generator Pattern (Python)

Python yield is perfect for this.

async def agent_stream(prompt):
    # 1. Start LLM Stream
    stream = await openai.ChatCompletion.create(..., stream=True)

    buffer = ""
    in_tool_mode = False

    async for chunk in stream:
        token = chunk.choices[0].delta.content
        buffer += token

        # 2. Logic to detect Tool Usage
        if "<tool>" in buffer:
            in_tool_mode = True
            yield event("status", "Thinking...")

            if not in_tool_mode:
                yield event("text", token)

4.2 The Client Consumer (React)

Using fetch with a ReadableStream.

const response = await fetch('/api/agent');
const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
 const { done, value } = await reader.read();
 if (done) break;
 const chunk = decoder.decode(value);
 // Parse "event: text\ndata: hello"
 handleSSE(chunk);
}

5. Code Examples: FastAPI with SSE

Here is a robust backend implementation using ssep format.

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import json
import asyncio

app = FastAPI()

async def event_generator():
    """
    Yields data in strict SSE format:
        data: {"key": "value"}\n\n
        """
        # Phase 1: Planning
        yield f"data: {json.dumps({'type': 'status', 'content': 'Searching Google...'})}\n\n"
        await asyncio.sleep(1) # Fake tool latency

        # Phase 2: Streaming Answer
        sentence = "The speed of light is 299,792 km/s."
        for word in sentence.split():
            yield f"data: {json.dumps({'type': 'token', 'content': word + ' '})}\n\n"
            await asyncio.sleep(0.1)

            # Phase 3: Done
            yield "data: [DONE]\n\n"

            @app.get("/stream")
            async def stream_endpoint():
                return StreamingResponse(event_generator(), media_type="text/event-stream")

6. Production Considerations

6.1 Buffering at the Edge

Nginx, Cloudflare, and AWS ALB love to “Buffer” responses to optimize compression (gzip). They might wait until 1KB of data is accumulated before sending it to the user. Fix:

  • Set header X-Accel-Buffering: no (Nginx).
  • Set header Cache-Control: no-cache.
  • Disable Gzip for /stream endpoints.

6.2 Timeouts

Standard HTTP requests time out after 60s. An agent might take 2 minutes. Fix: Configure your Load Balancer’s idle_timeout to 300s+ for streaming paths.


7. Common Pitfalls

  1. JSON Truncation: Trying to parse a JSON object via json.loads() while it’s still streaming (incomplete).
    • Fix: Use a streaming JSON parser (like json-stream library) or only parse line-delimited chunks.
  2. Flash of Unstyled Content: Streaming tokens causes the layout to shift violently (Cumulative Layout Shift).
    • Fix: Set a minimum height for the chat container.

8. Best Practices: “Skeleton Loaders” for Thoughts

Don’t just stream text. Stream Status Updates. Users love to see the “Brain”:

  • [Status: Reading PDF...]
  • [Status: calculating...]
  • [Text: The answer is 42]

This “Transparency” reduces perceived latency.


9. Connections to Other Topics

This connects to Speech Model Export (Speech).

  • Both deal with Streaming Latency.
  • In Speech, we show partial words (he -> hel -> hello).
  • In Agents, we show partial thoughts.
  • The UX challenge is identical: “Stability vs Speed”.

10. Real-World Examples

  • Perplexity.ai: The gold standard. They show the “Sources” appearing one by one, then the answer streams.
  • Vercel AI SDK: A library that standardizes the “Stream Data Protocol”, making it easy to hook Next.js to OpenAI streams.

11. Future Directions

  • Generative UI: Streaming not just text, but React Components.
  • Agent streams: <WeatherWidget temp="72" />.
  • Browser renders the widget instantly.
  • Duplex Speech: Streaming Audio In -> Streaming Audio Out (OpenAI GPT-4o). No text intermediate.

12. Key Takeaways

  1. SSE > WebSockets: For 99% of Chat Agents, SSE is simpler and friendlier to firewalls.
  2. Edge Buffering is the Enemy: If streaming isn’t working, check your Nginx config.
  3. Visual Latency: Aim for < 200ms TTFT (Time To First Token).
  4. Leakage Control: Build a state machine to hide raw “Tool Calls” from the end user.

Next in the series: Dependency Graphs for Agents – how to model agent plans as DAGs for parallel execution.


FAQ

Should I use WebSockets or SSE for streaming AI agents?

For most chat-style AI agents, Server-Sent Events (SSE) is the better choice. SSE is simpler to implement, works with standard load balancers, and is firewall-friendly. WebSockets are bidirectional and better suited for real-time gaming or collaborative editing, but they add unnecessary complexity for one-way LLM token streaming.

What is leakage control in streaming agents?

Leakage control prevents raw internal agent data – like JSON tool call syntax, intermediate reasoning steps, or observation results – from being streamed directly to the user. You build a state machine that detects tool usage in the token stream and only forwards final answer tokens to the frontend.

Why does my SSE streaming not work in production?

Edge proxies like Nginx, Cloudflare, and AWS ALB buffer responses by default, waiting until 1KB+ of data accumulates before sending. Fix this by setting X-Accel-Buffering: no for Nginx, Cache-Control: no-cache headers, and disabling gzip on streaming endpoints.

What is Time to First Token (TTFT) and why does it matter?

TTFT is the delay between the user sending a request and seeing the first token of the response. For streaming agents, aim for under 200ms TTFT. Users perceive blank screens as broken – even a brief status update like “Thinking…” makes the experience feel responsive while the agent processes.


Originally published at: arunbaby.com/ai-agents/0048-streaming-real-time-agents

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch