14 minute read

TL;DR: Meta-Harness (Stanford/MIT/KRAFTON, March 2026, arXiv:2603.28052) automates harness optimization: a Claude Code proposer reads raw execution traces from a structured filesystem, proposes new harness candidates, and iterates. On text classification, 48.6% versus 40.9% for ACE — with 4x fewer tokens. On TerminalBench-2, 76.4% versus 74.7% hand-engineered. The deciding factor: raw traces, not summaries. Adding text summaries to optimization gains 0.3 points over scores alone. Raw traces gain 15.

A server rack with select storage bays pulled open and glowing amber amid cool blue sealed bays, representing selective trace inspection in Meta-Harness


The field has been solving the wrong problem.

Prompt optimization tools — MIPRO, GEPA, ACE, OpenEvolve — share an architecture: collect a score, write a text summary of what went wrong, feed the summary to an LLM, generate a new prompt. The loop assumes the summary contains enough signal to improve on the next try.

Meta-Harness (Yoonho Lee, Roshen Nair, Qizheng Zhang, Kangwook Lee, Omar Khattab, and Chelsea Finn; Stanford University, KRAFTON, and MIT; March 2026) ran the ablation that tests this assumption directly. Adding text summaries of prior failures to scores-only: +0.3 accuracy points. Adding raw execution traces: +15 points. Summaries don’t help. The diagnostic signal lives in the logs.

That single result reframes every conversation about automated optimization. Not “how do we write better summaries?” but “why are we summarizing at all?”

What is a harness, and why does it matter more than you think?

A harness is the code that wraps your model: prompt templates, retrieval logic, context window management, completion-checking loops, output parsing, tool definitions. When you deploy an LLM-based system, you deploy a harness. Model weights determine what the model can do in principle; the harness determines what it does in practice.

The performance ceiling for harness engineering is real. In Meta-Harness’s TerminalBench-2 experiment, the hand-crafted Terminus-KIRA harness scored 74.7% on 89 agentic tasks. Meta-Harness, starting from scratch and searching over candidate code, reached 76.4% — with the same underlying model. Nearly two points from harness design alone.

Current optimization practice splits into two categories. Fine-tune the model: expensive, slow, you lose base generality. Optimize the prompt: DSPy, MIPROv2, and similar frameworks handle this well for shallow tasks. What nobody automates well is the harness code itself — retrieval policies, routing logic, context packing strategy, orchestration decisions. That’s what Meta-Harness targets.

graph LR
    U[User query] --> H[Harness]
    H --> R[Retrieval policy]
    H --> P[Prompt assembly]
    H --> T[Tool definitions]
    R --> M[LLM]
    P --> M
    T --> M
    M --> C[Completion check]
    C -->|Pass| O[Output]
    C -->|Fail| H

Most optimization tools treat the boxes inside Harness as fixed. Meta-Harness treats them as variables.

How does Meta-Harness search for better harnesses?

Meta-Harness stores three artifacts on disk for every candidate it evaluates: the harness source code (a self-contained Python file), the full execution traces (every input, intermediate output, and tool call), and the evaluation score. A proposer agent — Claude Code running on Claude Opus 4.6 — navigates this filesystem, reads what it needs, and proposes new candidates. The loop runs for N iterations; all candidates are kept, and a Pareto frontier is reported at the end.

The proposer reads a median of 82 files per iteration (range 69–99), roughly split between prior source code and execution traces. That’s not a compressed context buffer — it’s an archive the agent navigates with standard shell commands. This creates a context budget roughly 3 orders of magnitude larger than conventional text optimizers:

Method Tokens per iteration
OPRO 0.002M
GEPA 0.008M
TextGrad 0.015M
AlphaEvolve 0.022M
TTT-Discover 0.026M
Meta-Harness Up to 10.0M

The filesystem design solves a real constraint. Pass prior runs as prompt context and you hit the context limit fast. Summarize them and you lose information. A filesystem lets the proposer choose what to read — grep for failure patterns, diff two candidates, focus on a single problematic trace — bounded by what it decides to load rather than what fits in a fixed window.

Before evaluation, proposed harnesses pass an interface validation gate that checks for required API signatures, filtering broken proposals before they consume evaluation budget. A typical run evaluates ~60 harnesses over 20 iterations and completes in hours of wall-clock time.

flowchart TD
    A[Initialize filesystem] --> B[Proposer reads\ncode + traces + scores\n~82 files / iteration]
    B --> C[Proposes k new\nharness candidates]
    C --> D{Interface\nvalidation gate}
    D -->|Invalid| C
    D -->|Valid| E[Evaluate on\nsearch set]
    E --> F[Write code + traces\n+ scores to filesystem]
    F --> G{Iteration limit?}
    G -->|No| B
    G -->|Yes| H[Report Pareto frontier]

Why do text summaries fail as optimization feedback?

Text summaries fail because they destroy causal signal. When a harness fails on a batch of examples, the failure has a cause: a retrieval policy that misses boundary cases, a prompt ambiguous for one label class, a completion checker that exits too early. A summary captures the observation — “accuracy was 34%” — but discards the mechanism.

The ablation in the Meta-Harness paper makes this concrete:

Feedback type Median accuracy
Scores only 34.6%
Scores + text summaries 34.9%
Full filesystem with traces 50.0%

Adding summaries over scores: +0.3 points. Adding raw traces: +15.4 points. The summary condition is statistically noise.

Think of it as a debugging problem. A developer with a full stack trace and intermediate variable states can diagnose a bug. A developer with only the error message is guessing. Meta-Harness gives its optimizer the stack trace. Every other framework gives it the error message — then calls the result an “insight.”

This has broader implications for the field. Automated optimization has long optimized for token efficiency: compress feedback, fit more iterations in context, move faster. Meta-Harness bets the other direction — spend more tokens per iteration, but make each proposal actually informed. Given these results, that trade-off deserves scrutiny. The self-reflection and critique framework faces the same tension: internal critique loops that compress too aggressively lose exactly this causal signal.

It also reframes the relationship between scale and search. Meta-Harness is not brute force — it’s search with better information. The performance gap isn’t from more iterations; it’s from the proposer’s ability to reason causally about what changed.

What strategies does the agent actually discover?

The discovered harnesses are not obvious improvements. They’re structural strategies that require understanding a task’s failure modes. None of the baseline methods found them.

Text classification — Label-Primed Query (LPQ) variant: A single LLM call with three components: a label primer listing all valid output classes (preventing hallucinated labels), a coverage block with one example per class (ensuring the model has seen each target), and contrastive pairs at class boundaries (the examples most classifiers trip on). Retrieval uses TF-IDF with query-anchored pairing. Each component addresses a distinct failure mode visible in the execution traces.

Math reasoning — 4-route lexical router: On 200 IMO-level problems, the agent builds a lexical router with four tracks: combinatorics, geometry, number theory, and algebra/default. Each track has its own BM25 retrieval policy — combinatorics fetches 20 examples, deduplicates to 8, reranks to 3; geometry uses 1 fixed reference plus 2 BM25 neighbors; number theory fetches 12 with score-and-difficulty reranking. Domain-specific retrieval beats generic BM25 because retrieval failures are domain-specific too.

TerminalBench-2 — Environment bootstrapping: The winning harness adds a single step before the agent loop: a compound shell command that collects available programming languages and versions, installed package managers, available memory, and /app directory contents. This eliminates the exploratory turns agents typically spend discovering tool availability, giving the agent full environmental context from the first LLM call.

The qualitative trace of the TerminalBench-2 search shows something I didn’t expect: systematic causal reasoning over 10 iterations.

  • Iterations 1–2: bundle structural and prompt changes simultaneously; both regress from a 64.4% search-run baseline
  • Iteration 3: inspects prior traces, identifies that prompt changes — not structural changes — caused the regression
  • Iterations 4–6: isolates completion logic as the test variable; each regresses; the proposer concludes control-flow edits are high-risk
  • Iteration 7: shifts to purely additive modification (environment bootstrapping); this becomes the winning candidate
  • Iterations 8–10: explores combining the bootstrapping fix with earlier structural candidates, using the causal reasoning built in iterations 3–6

That multi-step causal chain is not available to a method with scalar feedback or text summaries. It requires reading prior traces to reconstruct what actually happened and what caused it.

For comparison, self-evolving agent architectures use evolutionary search over workflow composition — a related idea, but without the trace access that lets a proposer distinguish correlated from causal improvements.

How does Meta-Harness benchmark against prior optimizers?

The clearest comparison uses test-set averages across three text classification datasets (LawBench with 215 classes, Symptom2Disease with 22 classes, USPTO-50k with 180 classes):

Method Avg accuracy (test set) Context tokens
Zero-shot 27.4%
Few-shot (32 examples) 35.4%
MCE 40.0% 28.5K
ACE 40.9% 50.8K
Meta-Harness 48.6% 11.4K

Meta-Harness is 7.7 points above ACE while using 4x fewer inference-time tokens. On out-of-distribution generalization across 9 unseen classification datasets, it scores 73.1% versus ACE’s 70.2% — the +2.9 point gap holds outside the training distribution.

On retrieval-augmented math reasoning, Meta-Harness reaches 38.8% pass@1 versus 37.5% for BM25 and 34.1% without retrieval. Across 5 held-out models — GPT-OSS-20B, GPT-5.4-nano, GPT-5.4-mini, Gemini-3.1-Flash-Lite, and Gemini-3-Flash — the average gain is +4.7 points. The harness was optimized using one model and transfers to four others it never saw.

On TerminalBench-2, Meta-Harness with Opus 4.6 reaches 76.4%, ranked #2 among all Opus 4.6 agents on the leaderboard. With Haiku 4.5, it reaches 37.6%, ranked #1 among all Haiku 4.5 agents — a result that matters if you’re thinking about inference cost in production.

For context on these margins: typical prompt optimization methods report 1–3 point gains on their target benchmarks. The +7.7 on text classification is 2–3x larger. The gap suggests the optimization surface — harness code — has more headroom than the prompt text that DSPy and its peers target.

Where does Meta-Harness fall short?

Four limitations worth understanding before using this in production.

Single proposer, no ablation. All experiments use Claude Code with Opus 4.6 as the proposer. Whether gains hold with a different or weaker model is unstudied. The system’s ceiling is the proposer’s code reasoning ability.

Compute cost. 10M tokens per iteration at Opus rates is not cheap, and the paper provides no cost analysis. For tasks where search budget matters, this needs explicit estimation before committing to the approach. The Hermes Agent skill creation approach shows that runtime-learned strategies can substitute for expensive optimization in some settings — worth considering which regime you’re actually in.

TerminalBench-2 circularity. Search and evaluation use the same benchmark. The paper acknowledges this is standard for discovery tasks, but it’s a confound the field has flagged before.

No weight-harness co-optimization. Meta-Harness optimizes the harness; model weights stay fixed. Joint optimization — adapting both simultaneously — is left as future work. The practical limit: if the model’s base capability is the actual bottleneck, a better harness won’t close the gap.

The biggest open question is transfer to tasks with ambiguous failure modes. Text classification and terminal tasks have legible traces — you can read them and understand what failed. Open-ended generation tasks may not have this property. Whether the filesystem interface remains useful when traces don’t cleanly reveal failure causes remains to be tested.


Further reading


Frequently asked questions

What is Meta-Harness?

Meta-Harness (arXiv:2603.28052, Stanford/MIT/KRAFTON, March 2026) is an automated harness optimization system that treats harness code — prompts, retrieval logic, orchestration, and output parsing — as the search variable. A Claude Code agent reads raw execution traces and scores from a structured filesystem, proposes new harness candidates, evaluates them, and iterates. It finds harnesses that match or exceed hand-engineered solutions across text classification, math reasoning, and agentic coding benchmarks.

What is a harness in the context of LLMs?

A harness is all the code surrounding an LLM: prompt templates, retrieval policies, context management, tool definitions, completion-checking loops, and output parsing. It’s distinct from model weights. Most optimization frameworks (DSPy, MIPRO, GEPA) target prompt text. Meta-Harness targets the full harness code as a self-contained Python file, including retrieval strategy, routing logic, and orchestration — the parts that typically require manual engineering.

Why does Meta-Harness use raw traces instead of text summaries?

The paper’s ablation shows text summaries add only 0.3 accuracy points over scores alone. Full execution traces add 15 points. Summaries compress out the causal signal: they capture what happened but not why. Full traces preserve intermediate outputs, tool calls, and per-example failure modes — the information a coding agent needs to reason about what changed between attempts and propose targeted fixes rather than random variations.

How does Meta-Harness compare to DSPy?

DSPy optimizes over prompt text and few-shot examples. Meta-Harness optimizes over full harness code including retrieval policy, routing logic, and context management. DSPy’s MIPROv2 uses Bayesian optimization over a compact search space; Meta-Harness uses an agentic loop with up to 10M-token context access to prior runs. The approaches are complementary: DSPy is faster and more accessible; Meta-Harness targets a higher ceiling for complex, multi-step agent tasks.

What are the main benchmark results for Meta-Harness?

On text classification (LawBench 215 classes, Symptom2Disease 22 classes, USPTO-50k 180 classes): 48.6% average test-set accuracy versus 40.9% for ACE, using 4x fewer context tokens. On TerminalBench-2 (89 agentic tasks): 76.4% with Opus 4.6 versus 74.7% for the hand-engineered baseline, ranked #2 among all Opus 4.6 agents. On IMO-level math reasoning: +4.7 points average gain across 5 held-out models compared to BM25 baseline.

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch