The Glass-Box Blueprint: Evaluation Architectures for Multi-Agent, Multi-Modal Systems

Q: How do you evaluate multi-agent AI systems across different channels?

Use unified span correlation with OpenTelemetry so every interaction across voice, video, SMS, and email shares a single Trace ID. Each channel gets modality-specific judges (audio latency auditor, VLM grounding critic, persona consistency scorer) and handoff verification checks between agents to measure entity loss and semantic drift.

Q: What metrics matter for voice AI agent evaluation?

Key metrics include Audio TTFT (Time To First Token) targeting under 600ms, double-talk handling with back-off latency measurement, STT Word Error Rate tested against background noise profiles, and channel-specific stress tests for packet loss and high-latency mobile networks.

Q: What is shadow mode evaluation for AI agents?

Shadow mode runs an independent observer agent in parallel with the production agent on the same input. An LLM-as-a-Judge then compares both execution paths. Significant divergence triggers a Reasoning Anomaly alert, enabling continuous evaluation without impacting the user experience.

Q: How do you detect data loss during multi-agent handoffs?

Measure the Entity Loss Rate by counting how many entities (dates, times, locations, intents) are dropped during summarization between agents. Use a cross-encoder to compute a Semantic Drift Score between the state at Agent A and the state received by Agent B. Instruction adherence scoring verifies formatting and tone constraints are preserved.

13 minute read

“Building a single-agent chatbot is a logic problem. Building a multi-agent, multi-modal system that orchestrates across Voice, Video, SMS, and Email is a distributed systems problem where the failure modes are non-deterministic and the success metrics are subjective.”

TL;DR

Evaluating multi-agent, multi-modal AI systems demands a fundamentally different approach than single-agent accuracy metrics. The core requirement is unified traceability – every interaction across voice, video, SMS, and email must share a single OpenTelemetry Trace ID. Modality-specific judges handle voice latency (TTFT under 600ms), visual grounding, and persona-channel consistency. Agent-to-agent handoffs are verified through entity loss rate and semantic drift scoring. Shadow mode pipelines run parallel evaluations on production traffic without impacting users. For foundational evaluation concepts, see Agent Evaluation Frameworks and Agent Benchmarking Deep Dive.

A broadcast monitoring wall with four different screens each displaying a different signal type

1. Introduction: The Observability Crisis in Agentic Systems

In 2025, the “Agentic” paradigm shift reached its peak complexity. We moved from isolated RAG pipelines to Multi-Agent Systems (MAS) where a “Manager Agent” might delegate a task to a “Voice Agent” for an outbound call, which then hands off a summary to an “Email Agent,” while a “Video Agent” simultaneously monitors a user’s reaction over a viewport.

The traditional “Accuracy” metric is dead for these systems. If the Voice Agent has a 200ms lag, or the Video Agent misinterprets a micro-expression, the entire chain of reasoning collapses. To build these systems at scale, you need more than better prompts; you need a Traceable Evaluation Architecture.

This guide outlines the blueprint for creating a complete internal evaluation setup that provides 360-degree observability and traceability for multi-modal, multi-channel agents.

2. The Unified Traceability Architecture

The first rule of MAS evaluation: Unified Span Correlation. Every interaction, regardless of the channel (Phone, Video, SMS), must share a single Trace ID.

Using OpenTelemetry (OTEL) as the foundation, we extend the standard span to include modality-specific metadata.

Architecture Diagram: Multi-Agent Handoff & State Synchronization

      [ USER INPUT ]
           |
     +-----v-----+
     | Orchestrator| <--- [ TraceID: 8ac2 ] ---+
     +-----+-----+                             |
           |                                   |
    +------+------+--------------+             | [ Global State Store ]
    |             |              |             | (Redis / Postgres)
[ Voice Agt ] [ Video Agt ] [ Data Agt ]       |
    |             |              |             |
    +----(Handoff Spans: 8ac2-01, 8ac2-02)-----+

Traceability: Every handoff (e.g., from Voice to Email) creates a “Handoff Span” that logs the transfer of state and the reasoning behind the transition.
Modality Injection: We inject multimodal data (video frames, audio snippets) directly into the trace metadata or as linked blob storage references using a content-addressable storage (CAS) like S3.

2.2 Instrumenting Diverse Channels

Each channel requires a specific shim to bridge the physical world to the OTEL trace:

Phone (SIP/RTP): Instrumentation at the Media Server level (e.g., FreeSWITCH/Asterisk) to log “Silence-to-Speech” deltas.
Video (WebRTC): Client-side probes that send “Frame Arrival” events vs. “Agent Processing” events via DataChannels.
Async (Webhook): Durable execution IDs passed in metadata headers to ensure a response 2 hours later is still part of the original trace.

3. Evaluation Tools: The Landscape

Choosing the right tool depends on whether you value high-level dashboards or raw, low-level trace control.

3.1 Comparison of Popular Evaluation Platforms

Tool	Best For	Pros	Cons
LangSmith	Prototyping & RAG	Native integration with LangChain; Excellent UI; Great manual annotation tools.	Can be expensive at high volume; Proprietary lock-in.
Arize Phoenix	Open-source OTEL	100% OTEL compliant; Runs locally or in K8s; Great for cluster-based analysis.	Smaller community than LangSmith; Steeper learning curve for UI.
Weights & Biases (W&B)	LLM Finetuning+Evals	Industry standard for ML experiments; Great visualization for multimodal data.	Less focus on real-time agent tracing compared to others.
Langfuse	Production Tracing	Lightweight; Open source; Great for cost/token tracking across many agents.	Fewer built-in “multimodal” visualization modules.
Custom (ClickHouse)	High Scale	Infinite flexibility; Lowest cost at scale; Full control over SQL-based evals.	Requires significant engineering overhead to build and maintain.

4. Modality-Specific Evaluation Suites

A complete setup requires specialized “Judges” for each interaction channel.

4.1 The Voice & Telephony Suite (The Latency War)

For phone-based agents (e.g., VAPI/Retell), the "Vibes" are dictated by latency and naturalness.

Audio TTFT (Time To First Token): Measured from user silence to first audio buffer. Target: < 600ms.
Double-Talk Handling: How does the agent react when the user speaks simultaneously? We measure the Back-off Latency.

Architecture: Voice Evaluator Pipeline

[ Audio Source ] -> [ STT Engine ] -> [ LLM Reasoning ] -> [ TTS Engine ]
       |                  |                |                  |
       +------(Span)------+------(Span)----+-------(Span)-----+
                          |
                  [ Latency Auditor ]
                  (Measures gap between each span)

STT Sensitivity (Word Error Rate): Tested against background noise (Street, Coffee Shop, Car) using stress-test audio files.
Channel-Specific Stressors: Simulating packet loss (Jitter) and high-latency mobile networks to see if the agent “breaks” or degrades gracefully.

4.2 The Video & Avatar Suite (The Grounding War)

Visual Grounding: Does the agent’s $(x,y)$ coordinate selection match the visual scene?
Gaze & Lip-Sync Metric: Using computer vision models to score the sync between the generated audio and the avatar’s lip movements.
Temporal Consistency: Does the agent remember what happened 10 frames ago?
Emotional Alignment: Using a "Vision Judge" to compare user facial sentiment with the Agent’s response tone.

4.3 Async & Multi-Channel Agents (Email, SMS, Slack)

These agents must handle long-latency responses where the “State” might have changed in the real world.

State Drift Check: Before the Email Agent sends a 2-hour-later follow-up, does it check if the Voice Agent already resolved the issue in between?
Persona-Channel Consistency: Scoring if the email maintains a professional brand voice while the SMS remains concise and action-oriented.
Rate-Limit Awareness: Does the SMS Agent handle provider-level throttling without crashing the orchestration loop?

4.4 The Multi-Channel Trace Journey

To visualize how evals track a request across modalities, consider the following handoff logic that the evaluation engine must monitor.

Architecture Diagram: Multi-Channel Handoff & State Synchronization

      [ USER INPUT (Phone) ]
           |
     +-----v-----+
     | Voice Agt | <--- [ Span 1: STT+LLM+TTS ]
     +-----+-----+
           |
    { Context Upgrade } --- (User wants confirmation via Email)
           |
     +-----v-----+
     | Email Agt | <--- [ Span 2: SMTP+Tracking ]
     +-----+-----+
           |
    [ Shadow Agent ] <--- [ Span 3: Consistency Check ]

5. Multi-Agent Orchestration Metrics

When multiple agents collaborate, evaluate the Process, not just the Outcome.

5.1 The "Agent-to-Agent Contract" Verification

When Agent A (Voice) sends a summary to Agent B (Email), we must verify the integrity of the data transfer.

Entity Loss Rate: Measuring how many entities (Date, Time, Location, Intent) were dropped during the summarization/sharding phase.
Semantic Drift Score: Using a cross-encoder to score the semantic similarity between the “State at Agent A” and the “State at Agent B.”
Instruction Adherence during Handoff: Ensuring that Agent B follows the specific formatting or tone constraints provided by Agent A.

5.2 The "Agent Conflict" Score & Deadlock Detection

In multi-agent systems, agents often "disagree" or enter infinite reasoning loops.

Re-prompting Loops: How many times did Agent A ask Agent B for the same tool output?
Reasoning Circularity: Detecting if Agent A $\rightarrow$ Agent B $\rightarrow$ Agent C $\rightarrow$ Agent A.
Tool Contention: Measuring latency caused by multi-agent locking on external APIs or database rows.

6. The “Internal Eval” Infrastructure

6.1 Shadow Mode Pipelines

For every production request, the system spawns an asynchronous “Eval Request.”

Production Pass: Agent performs the task for the user.
Observer Pass: An independent “Shadow Agent” with the same context attempts the same task.
The Critic Pass: An LLM-as-a-Judge compares the two paths. Any significant divergence triggers a “Reasoning Anomaly” alert.

6.2 The Gold Standard Dataset (The “Evaluation Moat”)

You must maintain a set of “Gold Traces” hand-verified interactions across every modality.

Diagram: The Feedback Loop

[ Live Traffic ] --> [ Traces ] --> [ Anomaly Detector ]
                                          |
                                    [ Human Review ] ---+
                                          |             |
                                    [ Fine-tune Set ] <-+

7. Storage Schema: The ClickHouse Backbone

To handle the volume of multi-agent traces, use a wide-column store with hierarchical partitioning.

CREATE TABLE agent_traces (
    trace_id String,
    span_id String,
    parent_span_id String,
    agent_name LowCardinality(String),
    modality Enum('text', 'audio', 'video', 'api'),
    latency_ms Float64,
    interaction_json String, -- Includes STT/VLM results
    event_timestamp DateTime64(3, 'UTC')
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_timestamp)
ORDER BY (trace_id, event_timestamp);

8. Case Study: The “Interconnected Clinic”

Imagine a Healthcare Multi-Agent System where:

Phone Agent: Books an appointment.
Vision Agent: Recognizes the patient’s ID at the clinic.
Medical Agent: Summarizes the vocal conversation for the doctor.
SMS Agent: Sends follow-up care instructions.

The Diagnostic Failure: The Phone Agent recorded “Allergic to Penicillin.” The SMS Agent sent instructions involving Penicillin-based antibiotics. The Evaluation Discovery: The Traceability System flagged that the “Medical Agent” truncated the PII/Allergy section of the context during the handoff to the “SMS Agent.” The State Consistency Metric went from 1.0 to 0.0, alerting the engineering team to a context-window truncation bug in the summarization layer.

9. Summary Checklist for Engineering the Setup

Unified Traceability: Do all agents (Voice, Video, SMS) share a TraceID?
Tool Selection: Have you picked a platform (LangSmith/Phoenix/Custom) that fits your scale?
Proactive Shadowing: Are you running parallel evals on 10% of production traffic?
Modality Grounding: Is every Video/Voice response grounded in a verifiable trace?
Persona Consistency: Does the agent’s tone remain identical across SMS and Voice?
Handoff Verification: Is there a “Contract Check” between Agent A and Agent B’s data?
Latency Monitoring: Are you tracking “Vocal Stop to Audio Start” in under 600ms?

To implement this architecture, your repository should be structured to handle the separation of concerns between agent execution, channel adaptation, and the evaluation layer.

multi-modal-agent-evals/
├── src/
│   ├── orchestrator/           # Gateway and routing logic
│   │   ├── main.py             # Entry point for the gateway
│   │   └── router.py           # Logic to delegate tasks to agents
│   ├── agents/                 # Specialized agent implementations
│   │   ├── base.py             # Base agent class with telemetry hooks
│   │   ├── voice_agent.py      # Audio processing and telephony integration
│   │   ├── video_agent.py      # Vision-based interaction logic
│   │   └── notification_agent.py # Email and SMS adapters
│   ├── adapters/               # Bridge between channels and OTEL
│   │   ├── sip_adapter.py      # Telephony (SIP/RTP) latency probes
│   │   ├── webrtc_adapter.py   # Video frame arrival collectors
│   │   └── webhook_adapter.py  # Async delivery status monitors
│   ├── observability/          # Telemetry and logging configuration
│   │   ├── otel_setup.py       # OpenTelemetry SDK / Trace ID propagator
│   │   └── exporter.py         # Custom exporter for ClickHouse/Phoenix
│   ├── evals/                  # Evaluation and judging logic
│   │   ├── judges/             # Specialized judge agents
│   │   │   ├── llm_judge.py    # Reasoning & persona consistency auditor
│   │   │   ├── vlm_judge.py    # Visual grounding/temporal logic critic
│   │   │   └── audio_judge.py  # Jitter & phonation-overlap auditor
│   │   ├── shadow_mode.py      # Asynchronous parallel execution logic
│   │   └── metrics.py          # Handoff Accuracy & State Consistency calc
│   └── state/                  # Shared memory and global context
│       └── global_store.py     # Hierarchical context store (Redis/PG)
└── tests/
    └── gold_sets/              # Hand-verified multi-modal snapshots
        ├── voice_gold.jsonl    # Ground-truth audio transcripts + latency
        └── video_gold.jsonl    # Expected bounding boxes + intent

10. Conclusion: The Glass-Box Standard

The era of “Black-Box” agents is over. To build reliable systems that users trust across multi-modal channels, you must invest as much in the Evaluation Factory as you do in the agents themselves.

By implementing a unified tracing protocol and a modular judge architecture, you turn your non-deterministic agent swarm into a predictable, observable, and continuously improving system.

For the voice-specific architecture patterns that complement this evaluation framework, see Voice Agent Architecture.

FAQ

How do you evaluate multi-agent AI systems across different channels?

Use unified span correlation with OpenTelemetry so every interaction across voice, video, SMS, and email shares a single Trace ID. Each channel gets modality-specific judges: an audio latency auditor measures TTFT and double-talk handling for voice, a VLM judge scores visual grounding and temporal consistency for video, and a persona consistency scorer ensures tone alignment across SMS and email. Handoff verification checks between agents measure entity loss rate and semantic drift to catch context truncation bugs.

What metrics matter for voice AI agent evaluation?

The critical metrics for voice agents are Audio TTFT (Time To First Token) targeting under 600ms from user silence to first audio buffer, double-talk handling measured through back-off latency when the user speaks simultaneously, STT Word Error Rate tested against background noise profiles (street, coffee shop, car), and channel-specific stress tests simulating packet loss (jitter) and high-latency mobile networks to verify graceful degradation rather than hard failure.

What is shadow mode evaluation for AI agents?

Shadow mode runs an independent observer agent in parallel with the production agent, processing the same input context. An LLM-as-a-Judge then compares both execution paths, examining reasoning traces and final outputs. Any significant divergence triggers a “Reasoning Anomaly” alert for human review. This approach enables continuous evaluation on live production traffic without impacting the user experience, and anomalous traces feed back into the gold standard dataset for regression testing.

How do you detect data loss during multi-agent handoffs?

Measure the Entity Loss Rate by counting how many critical entities (dates, times, locations, intents, medical information) are dropped during the summarization phase when one agent hands off state to another. Use a cross-encoder model to compute a Semantic Drift Score between the full state at Agent A and the summarized state received by Agent B. Additionally, instruction adherence scoring verifies that formatting, tone constraints, and domain-specific rules specified by the delegating agent are preserved through the transition.

Originally published at: arunbaby.com/ai-agents/0062-multi-agent-multi-modal-evals

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch

The Glass-Box Blueprint: Evaluation Architectures for Multi-Agent, Multi-Modal Systems

TL;DR

1. Introduction: The Observability Crisis in Agentic Systems

2. The Unified Traceability Architecture

2.2 Instrumenting Diverse Channels

3. Evaluation Tools: The Landscape

3.1 Comparison of Popular Evaluation Platforms

4. Modality-Specific Evaluation Suites

4.1 The Voice & Telephony Suite (The Latency War)

4.2 The Video & Avatar Suite (The Grounding War)

4.3 Async & Multi-Channel Agents (Email, SMS, Slack)

4.4 The Multi-Channel Trace Journey

5. Multi-Agent Orchestration Metrics

5.1 The "Agent-to-Agent Contract" Verification

5.2 The "Agent Conflict" Score & Deadlock Detection

6. The “Internal Eval” Infrastructure

6.1 Shadow Mode Pipelines

6.2 The Gold Standard Dataset (The “Evaluation Moat”)

7. Storage Schema: The ClickHouse Backbone

8. Case Study: The “Interconnected Clinic”

9. Summary Checklist for Engineering the Setup

10. Conclusion: The Glass-Box Standard

FAQ

Related across topics

Share on

TL;DR

1. Introduction: The Observability Crisis in Agentic Systems

2. The Unified Traceability Architecture

2.1 The Multi-Modal Span Protocol

2.2 Instrumenting Diverse Channels

3. Evaluation Tools: The Landscape

3.1 Comparison of Popular Evaluation Platforms

4. Modality-Specific Evaluation Suites

4.1 The Voice & Telephony Suite (The Latency War)

4.2 The Video & Avatar Suite (The Grounding War)

4.3 Async & Multi-Channel Agents (Email, SMS, Slack)

4.4 The Multi-Channel Trace Journey

5. Multi-Agent Orchestration Metrics

5.1 The "Agent-to-Agent Contract" Verification

5.2 The "Agent Conflict" Score & Deadlock Detection

6. The “Internal Eval” Infrastructure

6.1 Shadow Mode Pipelines

6.2 The Gold Standard Dataset (The “Evaluation Moat”)

7. Storage Schema: The ClickHouse Backbone

8. Case Study: The “Interconnected Clinic”

9. Summary Checklist for Engineering the Setup

11. Minimal Repo Structure for a Multi-Modal Eval Setup

10. Conclusion: The Glass-Box Standard

FAQ

Related across topics

LFU Cache (Least Frequently Used)

Precision Speed: 14 Techniques for High-Performance ML Inference

Security for Voicebots: Red Teaming and Blue Teaming Production Voice Agents

Share on