The judgment shift: what engineering looks like when agents write 80% of the code
“The pilot who fights the autopilot crashes faster than the pilot who never learned to fly.”
TL;DR
Karpathy says he went from writing 80% of his code to writing 20%, delegating the rest to agents. Everyone cites this stat. Nobody operationalizes what replaces the code-writing time. The answer is judgment: task decomposition, specification writing, output evaluation, failure triage, and escalation timing. This post breaks down each skill with concrete practices. For how Karpathy’s autoresearch framework applies these principles at scale, see Karpathy’s autoresearch.

What did Karpathy actually say?
In January 2026, Karpathy described a phase change in how he programs. Not a gradual improvement — a discontinuous shift. He said coding agents “barely worked” before late 2025 and now handle the majority of his programming. He coined the term vibe coding — writing software by describing intent rather than implementation, trusting the agent to translate intent into code.
The specific ratio he described: from 80% writing code, 20% reviewing, to 20% writing code, 80% reviewing and directing. The shift happened over weeks, not years. With $211B flowing into AI in 2025 and inference costs dropping 280x since November 2022 (Stanford HAI 2025 AI Index), the economic pressure to adopt this workflow is real. The question is not whether agents will write most code. It is whether engineers will learn to judge the output fast enough.
The broader industry is converging on a similar insight: the unit of comparison is shifting from raw model benchmarks to the full system — model plus orchestration plus tool access. The same logic applies to engineering: the unit of evaluation is not “lines of code written” but “problems solved per unit time.” Agents change the equation.
What does judgment actually mean in practice?
Everyone says “judgment.” Nobody defines it. Here are five concrete skills, ordered by how frequently you use them in a typical agentic engineering day.
Skill 1: Task decomposition. Breaking a problem into pieces an agent can handle. Too large, and the agent loses context or hallucinates a shortcut. Too small, and you spend more time specifying than doing. The calibration takes practice. A useful heuristic: if you can describe the task in one paragraph with clear acceptance criteria, it is the right size.
Skill 2: Specification writing. The agent does what you ask. If your ask is vague, the output is vague. Writing a precise specification before delegating is the highest-leverage time investment in agentic engineering. Include: what the output should look like, what constraints it must satisfy, what edge cases matter. This is not a prompt template. It is an acceptance test written before the work begins.
Skill 3: Output evaluation. Reviewing agent-generated code with the right level of scrutiny. Not line-by-line (too slow) and not rubber-stamping (too dangerous). The skill is knowing which parts of the output to trust and which to verify. Trust formatting. Trust boilerplate. Verify logic, verify edge case handling, verify that the code actually does what the specification asked for.
Skill 4: Failure triage. When the agent’s output is wrong, diagnosing whether the failure is recoverable (unclear spec, easy to redirect) or fundamental (the agent cannot do this task). Recoverable failures get a revised specification. Fundamental failures get taken back manually. The cost of misdiagnosing — retrying a fundamentally broken approach — is the biggest time sink in agentic engineering.
Skill 5: Escalation timing. Knowing when to take back the wheel. Two retries with refined specifications is reasonable. Five retries on the same problem is a signal. The agent is not going to get it right on attempt six. Take it back, do it yourself, and file it as a task type the agent cannot handle yet. Build your personal map of agent capabilities over time.
graph TD
A[Problem arrives] --> B[Decompose into<br/>agent-sized tasks]
B --> C[Write specification<br/>with acceptance criteria]
C --> D[Delegate to agent]
D --> E{Output meets spec?}
E -->|Yes| F[Ship it]
E -->|No| G{Recoverable failure?}
G -->|Yes| H[Revise specification]
H --> D
G -->|No| I[Take it back manually]
I --> J[Update capability map]
What does a senior engineer’s day look like now?
The old loop: read requirements, design, code, test, review, deploy. The new loop: read requirements, decompose, specify, delegate, evaluate, iterate or ship. The second loop is faster on most tasks and slower on novel problems the agent has not seen patterns for.
A realistic day: morning standup, then 2-3 hours delegating well-specified tasks to agents — one generating a data migration script, another writing test coverage for an existing module, a third drafting API documentation. Between delegations, review the prior batch of agent outputs. Flag the migration script’s edge case handling for manual review. Accept the test coverage as-is. Revise the documentation spec because the agent missed the audience context.
Afternoon: tackle the architectural decision the agent cannot make — which service owns this data, where the abstraction boundary goes, whether to accept technical debt now or invest in the clean solution. These are judgment calls that require understanding the codebase, the team, the business constraints. No agent handles this yet.
The ratio is not fixed. Some days are 90% delegation. Some are 90% manual. The skill is matching the workflow to the task, not forcing every task into the agent-delegated pattern. For how automated optimization pipelines apply similar judgment at the harness level, see Meta-Harness.
How do you build these skills?
Start with the easiest judgment: output evaluation on tasks you already know how to do manually. Delegate something you could write yourself in 30 minutes. The agent produces output in 2 minutes. Spend 5 minutes evaluating. Was it correct? Was it close? Where did it fail? After 10 repetitions, you have a calibrated model of what the agent does well and where it breaks.
Then move to specification writing. Take a task you planned to do manually and write the specification instead. Time yourself. If writing the spec took longer than doing the task, the task is too small for delegation. If the spec took 5 minutes and the agent produced correct output from it, you just found a 6x productivity multiplier on that task type.
The judgment skills compound. Better specifications produce better agent output. Better output evaluation catches errors earlier. Better failure triage avoids wasted retry loops. After a month of deliberate practice, most engineers report the shift Karpathy described — not incremental improvement, but a phase change in how work gets done.
Key takeaways
- The shift is from production to judgment. Writing code matters less. Evaluating code matters more. The five skills: decompose, specify, evaluate, triage, escalate.
- Specification is the highest-leverage skill. A precise specification before delegation produces better output than three rounds of vague prompting and revision.
- Build your capability map. Track which task types agents handle reliably and which they botch. Your map will differ from others’ because it depends on your domain, your codebase, and your agent stack.
- Match the workflow to the task. Some problems are 90% delegatable. Some are 90% manual. Forcing everything into one pattern wastes time both ways.
- Judgment compounds with practice. Deliberate practice on all five skills produces the phase change Karpathy describes within weeks, not months.
Further reading
- Karpathy’s autoresearch — automated experiment pipelines applying judgment at scale
- Meta-Harness — LLM optimization through raw execution trace analysis
- AI agents that actually make money — commercial viability framework for agent investments
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch