How accurate is Agent Psychometrics at predicting real-world performance?

The behavioral profiles from probe tasks predict task-level performance on SWE-bench Verified and Terminal-Bench with execution feedback. The system identifies which task categories an agent will struggle with before deployment, enabling targeted improvements or scope restrictions.

How long does an Agent Psychometrics evaluation take?

A structured probe battery can be completed in approximately 2 hours, yielding a capability profile that predicts performance across multiple task categories. This is faster and cheaper than running full benchmark suites, and produces actionable per-category predictions rather than a single aggregate score.

Can Agent Psychometrics replace traditional benchmarks like SWE-bench?

No — it complements them. Agent Psychometrics is a pre-deployment screening tool that predicts where an agent will succeed and fail, so you can prioritize which full benchmarks to run and which improvements to make. Think of it as a triage step: profile first, then run targeted benchmarks on the weak areas.

Agent psychometrics: predicting coding agent performance before you run it

9 minute read

Standardized test answer sheets in a grid with a red pencil, representing structured agent evaluation

TL;DR — You cannot A/B test agents in production — a failed coding agent action means corrupted repos, wrong refactors, or broken builds. Agent Psychometrics (arXiv 2604.00594) borrows behavioral profiling from organizational psychology: structured probe tasks yield a capability profile that predicts real-world performance on SWE-bench Verified and Terminal-Bench. Profile before you deploy, not after you break things.

The cost of learning through failure

When you deploy a new UI button, you A/B test it. When you deploy a new coding agent, you cannot. A wrong button color loses clicks. A wrong agent action corrupts a codebase, sends a broken PR, or deletes a production database migration. The feedback loop for agents is expensive, slow, and sometimes irreversible.

The standard approach — run benchmarks, check aggregate scores, ship — tells you the agent’s average capability. It does not tell you which specific task categories will fail in your production workload. An agent that scores 72% on SWE-bench might ace single-file bug fixes and fail every multi-file refactoring task. If your codebase is mostly multi-file refactoring, that 72% is meaningless.

Agent Psychometrics (arXiv 2604.00594, April 2026) proposes a different approach: profile the agent’s capabilities before deployment using structured probe tasks designed to test specific skills. The output is not a single score but a capability map — what the agent does well, what it does poorly, and where you should either improve it or restrict its scope.

What organizational psychology teaches us about agent evaluation

The insight behind Agent Psychometrics comes from a well-established field. Organizational psychologists have spent decades building structured assessment batteries to predict job performance. The principle: administer tasks that probe specific competencies (communication, problem-solving, domain knowledge), extract a behavioral profile, and use that profile to predict performance in the actual role.

The transfer to AI agents is direct. A coding agent needs specific competencies: understanding codebases, planning multi-step changes, writing correct code, testing its own output, recovering from errors. Each competency can be tested with targeted probes:

Competency	Probe task	What it reveals
Code comprehension	“Explain what this function does and identify the bug”	Whether the agent reads code accurately before modifying it
Multi-file reasoning	“This change in file A requires updates in files B and C. Find them.”	Whether the agent tracks cross-file dependencies
Error recovery	“This test fails after your change. Diagnose and fix without reverting.”	Whether the agent debugs or gives up
Scope discipline	“Fix the bug in this function. Do not modify anything else.”	Whether the agent over-engineers or stays focused
Test awareness	“Write a change and verify it passes existing tests.”	Whether the agent validates its own work

A 2-hour probe battery covering these five competencies produces a profile that predicts real-world failure categories. If the agent fails the multi-file reasoning probe, you know not to deploy it on refactoring tasks — regardless of its aggregate benchmark score.

How the prediction pipeline works

Agent Psychometrics validates its approach against SWE-bench Verified and Terminal-Bench with execution feedback. The pipeline:

Administer probes. Run the agent through a structured set of tasks designed to test individual competencies. Each probe is short (minutes, not hours) and tests one thing.
Extract the behavioral profile. From the probe results, build a vector of capability scores: code comprehension 0.85, multi-file reasoning 0.42, error recovery 0.71, scope discipline 0.93, test awareness 0.68.
Map capabilities to task categories. Using historical data from benchmark runs, map which capabilities predict success on which task categories. Multi-file reasoning predicts multi-file bug fixes. Error recovery predicts tasks with complex dependency chains.
Predict per-category performance. For each task category in your production workload, estimate the agent’s success probability based on its capability profile.

graph LR
    A[Structured probe tasks - 2 hours] --> B[Behavioral profile - 5 capability scores]
    B --> C[Capability-to-task mapping]
    C --> D[Per-category performance prediction]
    D --> E{Deploy decision}
    E -->|Strong across categories| F[Deploy with standard monitoring]
    E -->|Weak in specific category| G[Restrict scope OR improve that capability]
    E -->|Weak across categories| H[Do not deploy - iterate on agent]
    
    style G fill:#ff9800,color:#000
    style H fill:#d32f2f,color:#fff

The practical outcome: instead of “this agent scores 72% on SWE-bench,” you get “this agent will likely succeed at single-file bug fixes (87% predicted) but will likely fail at multi-file refactoring (34% predicted) and has moderate error recovery (61%).” That prediction is actionable — you can deploy with scope restrictions, invest in improving multi-file reasoning, or switch to a different agent for that task category.

What this changes about agent deployment

Agent evaluation today is benchmark-centric. Run SWE-bench, get a score, compare to other agents, deploy the highest scorer. The problem: benchmark scores are aggregated across task categories that may not match your production workload, and a high aggregate score can mask catastrophic failure in specific categories.

Agent Psychometrics shifts evaluation from “how good is this agent overall” to “how good is this agent at the specific things I need it to do.” The implications:

For agent developers: Probe-based profiling identifies which capabilities to invest in. If your agent’s multi-file reasoning is the bottleneck, you know where to focus fine-tuning, prompt improvement, or architectural changes. A targeted 10% improvement in multi-file reasoning might matter more than a 2% improvement across all categories.

For agent deployers: Capability profiles enable informed scope decisions. Rather than deploying an agent with full access and hoping it performs well everywhere, you deploy with permissions matched to capabilities. Strong at code review? Deploy for review. Weak at refactoring? Restrict that scope until the profile improves.

For agent evaluation frameworks: Psychometric profiling complements existing benchmarks rather than replacing them. Use probes for rapid screening (2 hours), then run full benchmarks on the categories the profile predicts will be problematic. This reduces benchmark compute by focusing on the areas that matter.

Building your own probe battery

You do not need to wait for a standard probe battery. The principle is straightforward: design short tasks that isolate specific agent competencies.

Start with failure analysis. Look at where your agent fails in production or testing. Classify failures into categories (wrong code, missed dependencies, broke existing tests, over-engineered solution, gave up). Each category suggests a competency gap that needs a probe.

Design probes that test one thing. A good probe has one right answer (or a narrow set of right answers) and tests one competency. “Fix this bug” tests too many things at once. “Identify which file this bug originates from” tests code comprehension specifically.

Run probes before every model update. When you upgrade the underlying LLM, swap frameworks, or modify prompts, re-run the probe battery. Compare profiles before and after. If a model update improves aggregate benchmark scores but degrades multi-file reasoning (a realistic scenario with model updates), the probe catches it before production does.

Calibrate against production outcomes. Over time, compare probe predictions against actual production performance. Which probes predict production failures most accurately? Weight those probes higher. Which probes have no predictive value? Drop them. The probe battery should evolve with your deployment experience.

Key takeaways

Agent Psychometrics (arXiv 2604.00594) applies behavioral profiling from organizational psychology to predict AI agent performance from structured probe tasks
A 2-hour probe battery produces per-category performance predictions, not just an aggregate score — identifying where an agent will succeed and where it will fail
Validated against SWE-bench Verified and Terminal-Bench with execution feedback
The practical shift: from “is this agent good enough to deploy?” to “what is this agent good enough to deploy for?”
Probe-based evaluation complements benchmarks: screen quickly with probes, then run targeted full benchmarks on identified weak areas
Build probes from production failure analysis, test one competency each, re-run on every model or prompt update, and calibrate against production outcomes over time

FAQ

What is Agent Psychometrics? A research framework (arXiv 2604.00594) that applies behavioral profiling to AI agents. Structured probe tasks test specific competencies (code comprehension, multi-file reasoning, error recovery, scope discipline, test awareness). The resulting capability profile predicts per-category performance on real-world benchmarks.

How long does an evaluation take? Approximately 2 hours for a structured probe battery covering five core competencies. This is faster than running full benchmark suites and produces more actionable output — per-category predictions rather than a single aggregate score.

Does this replace SWE-bench? No. Agent Psychometrics is a screening tool. Use it to identify which categories to investigate further, then run targeted benchmarks on the weak areas. Think of probes as triage before the full evaluation.

Can I use this for non-coding agents? The principle generalizes. Design probes that test the specific competencies your agent needs — research accuracy, tool selection, communication clarity, task decomposition — and validate against your production outcomes. The organizational psychology literature provides extensive guidance on designing assessment batteries for different job roles.

Agent psychometrics: predicting coding agent performance before you run it

The cost of learning through failure

What organizational psychology teaches us about agent evaluation

How the prediction pipeline works

What this changes about agent deployment

Building your own probe battery

Key takeaways

FAQ

Further reading

Related across topics

Share on

The cost of learning through failure

What organizational psychology teaches us about agent evaluation

How the prediction pipeline works

What this changes about agent deployment

Building your own probe battery

Key takeaways

FAQ

Further reading

Related across topics

AI harness engineering: what the top labs get right

Share on