AI agents that actually make money: the commercial viability framework
The question is never “can you build it?”
Any team with a capable engineer and access to a good model can build an agent in a week. The demo will impress. The Slack channel will celebrate. The project will get a budget.
Then, six months later, it quietly dies. Not because the model was too slow or the tools were unreliable — but because nobody ran the math on whether this task was actually worth automating.
According to Gartner’s June 2025 analysis, over 40% of agentic AI projects are on track to be scrapped by 2027. The cause is not technical failure. It’s that teams skip the economics.
This post is a framework for not skipping the economics. It gives you a three-axis scoring system you can apply to any agent idea in about ten minutes. I’ll also map 12 real deployed agent archetypes against it, so you can see where the commercial wins actually cluster.
TL;DR
Most agent projects fail commercially not because the technology breaks but because teams pick the wrong use cases. Three axes determine whether an agent creates or destroys value: task automation leverage (time freed × loaded cost), failure tolerance (cost of a wrong action), and supervision ratio (human time per agent action). This post builds that framework and maps 12 real deployed archetypes against it. See What are AI agents for the foundations this analysis assumes.

Why most agent projects fail commercially
The short answer: teams optimize for “technically possible” instead of “commercially justified.” Three failure modes account for most dead agent projects.
The first failure mode is wrong use case selection — automating tasks where the economics never worked. A task that takes a $20/hour data entry clerk 4 minutes to complete has $1.33 of loaded cost per task. Building and maintaining an agent for that probably costs more than just paying the clerk.
The second failure mode is misjudged failure tolerance. Agents make mistakes. The question is what those mistakes cost. An agent that occasionally misroutes a customer support ticket is cheap to fix. An agent that occasionally approves the wrong loan application is not. Teams that use the same confidence threshold across both types of tasks end up either over-supervising low-stakes agents (destroying the margin) or under-supervising high-stakes ones (creating liability).
The third failure mode is what I call the supervision illusion — projecting zero ongoing human oversight into ROI models. In pilots, one engineer watches everything closely. In production, that engineer is watching thousands of actions with partial attention. The hidden cost of ongoing review, prompt maintenance, and edge-case triage is consistently underestimated — enterprise AI cost guides from 2025 put ongoing human oversight at 30-50% of first-year total cost of ownership.
Gartner’s data on this is sobering: 40% of projects scrapped by 2027. But it’s not the 40% that’s interesting — it’s that the projects that survive share a common pattern. They were selected by economics, not by “this would be cool.”
The three-axis evaluation framework
Three numbers determine whether any agent use case will generate value: task automation leverage, failure tolerance, and supervision ratio. Score each axis before writing a line of code.
Axis 1: Task automation leverage
Task automation leverage = (time per task in hours) × (loaded hourly cost) × (task volume per month).
This is the total monthly value at stake. It tells you the ceiling — the maximum possible savings if the agent handles everything perfectly. If the ceiling is low, stop here.
Example: a document extraction task that takes a paralegal 15 minutes per contract, at a $120/hour loaded cost, with 400 contracts per month = $12,000/month in leverage. Worth pursuing. A task that takes a $25/hour clerk 3 minutes, 1,000 times a month = $1,250/month. Much harder to justify build and maintenance costs.
A useful rule of thumb: leverage needs to exceed $10,000/month for a project to recover its costs within 12 months at typical enterprise build and operating costs.
Axis 2: Failure tolerance
Failure tolerance is the inverse of the cost of a wrong agent action. High failure tolerance means wrong outputs are cheap to fix. Low failure tolerance means wrong outputs are expensive or irreversible.
Rate it on three tiers:
- High tolerance: Wrong action costs < $100 to fix or is purely cosmetic. Customer support misrouting, content draft errors, lead scoring mistakes.
- Medium tolerance: Wrong action costs $100-$10,000 to fix, with moderate reputational risk. Contract clause extraction errors, invoice discrepancy misses, scheduling conflicts.
- Low tolerance: Wrong action costs > $10,000 or is irreversible. Financial transaction execution, medical treatment decisions, legal filings.
Low-tolerance use cases are not impossible — but they require supervision ratios close to 1.0 (a human reviews every action), which often destroys the economics unless the leverage is enormous.
Axis 3: Supervision ratio
Supervision ratio = human-minutes per agent action, ongoing in production.
This is the number teams most often set to zero in their projections and then find is actually 0.2 or 0.3 in production. At 0.3, an agent handling 1,000 tasks per day requires 300 minutes (5 hours) of human attention per day — someone’s entire morning — for review, escalation handling, and prompt tuning.
A sustainable agent typically has a supervision ratio below 0.1. Below 0.05 is excellent.
Here is the framework as a scoring matrix:
COMMERCIAL VIABILITY SCORING MATRIX
=====================================
Axis 1: Task Automation Leverage (monthly)
> $50k/month → 3 pts
$10k-$50k/month → 2 pts
$1k-$10k/month → 1 pt
< $1k/month → 0 pts (stop here)
Axis 2: Failure Tolerance
High → 3 pts
Medium → 2 pts
Low → 1 pt
Catastrophic → 0 pts (human-in-loop required, model leverage)
Axis 3: Supervision Ratio (achievable in production)
< 0.05 → 3 pts
0.05–0.10 → 2 pts
0.10–0.30 → 1 pt
> 0.30 → 0 pts (supervision cost eliminates margin)
Total Score:
7–9 pts → Strong commercial case, build with confidence
5–6 pts → Marginal; validate with a narrow pilot before full build
3–4 pts → Weak; reconsider the use case or scope
0–2 pts → Stop. The economics don't work.
The framework is intentionally blunt. A 2-point score isn’t a project to “refine” — it’s a project to kill.
Mapping 12 agent archetypes to the matrix
The 12 most commercially deployed agent archetypes cluster into three groups: high-confidence wins, conditional plays, and traps that keep getting funded despite poor economics.
For each archetype, I’ve scored the three axes based on real deployment data where available.
| Archetype | Leverage | Failure Tolerance | Supervision Ratio | Total | Verdict |
|---|---|---|---|---|---|
| Customer support deflection | High ($50k+) | High | Low (0.04–0.08) | 8–9 | Strong |
| Invoice / receipt extraction | Medium–High | Medium | Low (0.05–0.10) | 6–8 | Strong |
| Sales lead qualification | High | High | Low–Medium (0.05–0.15) | 6–8 | Strong |
| Code review assistance | Medium | Medium | Medium (0.10–0.20) | 5–7 | Conditional |
| Contract clause extraction | High | Medium | Medium (0.10–0.20) | 6–7 | Conditional |
| Loan application pre-screening | High | Low–Medium | Medium–High (0.20–0.40) | 4–6 | Conditional |
| Meeting notes / action items | Low–Medium | High | Low (0.02–0.05) | 4–6 | Conditional |
| HR policy Q&A | Low | High | Low (0.02–0.05) | 3–5 | Marginal |
| Content first drafts | Medium | High | Medium (0.10–0.25) | 5–6 | Conditional |
| Compliance monitoring | High | Low | High (0.30–0.50) | 3–5 | Trap |
| Medical triage routing | High | Catastrophic | Very High | 1–3 | Trap |
| Financial transaction execution | Very High | Catastrophic | Very High | 1–3 | Trap |
The three strong wins in detail
Customer support deflection is the clearest commercial win in the agent space right now. Klarna’s support agent handled 2.3 million conversations, cut resolution time from 11 minutes to under 2 minutes, and generated an estimated $60 million in savings (CX Dive, 2024). Intercom’s Fin agent reports an average 51% autonomous resolution rate across its customer base. The economics work because support tickets are high-volume, the loaded cost of human agents is real, and an incorrectly handled ticket costs almost nothing to fix — route it to a human.
Invoice and document extraction has an underappreciated ROI profile. GravityStack reduced credit agreement review time by over 200% using agentic AI for a global bank — turnaround dropped from three-to-five days to under 24 hours (Vellum.ai case study, 2025). More broadly, companies using AP automation cut invoice processing costs by 50-80% and reduce processing time by 58-75%, per Parseur’s 2025 global trends analysis. The supervision ratio stays low because output format is deterministic (structured fields) and errors are caught before downstream commitment.
Sales lead qualification delivers 25-30% increases in sales productivity and 15-40% improvements in conversion rates, according to Persana AI’s 2025 analysis of enterprise deployments. One fintech startup increased qualified leads by 215%. The failure tolerance is high — a misqualified lead costs a 15-minute sales call, not a contract dispute.
The three traps in detail
Compliance monitoring, medical triage, and financial execution are traps because the leverage is real but the failure tolerance and required supervision ratios make the economics collapse. These use cases attract teams because the dollar value at stake is enormous — but that dollar value is also the risk if the agent gets it wrong. A compliance miss can trigger regulatory penalties. A triage error has patient safety implications. A financial execution error is often irreversible.
The pattern with traps is that teams model the leverage (big number) but not the failure cost (bigger number). The correct play for these archetypes is usually human-in-the-loop with agent assistance, not autonomous agent execution — which is a fundamentally different product.
The supervision ratio trap: why agents fail after launch, not before
The supervision ratio almost always rises after launch. Teams that don’t model this correctly find themselves maintaining an expensive system that saves less time than they projected — and the problem compounds as scale increases.
Here is the timeline I see repeatedly:
In the pilot phase (months 1-2), one engineer handles 50-100 agent tasks per day with close monitoring. The supervision ratio looks excellent — they’re watching every output, but it’s fast. The ROI calculation looks great.
In the initial production phase (months 3-6), volume scales to 1,000 tasks per day. The same engineer is now responsible for reviewing agent outputs at 20x the volume. They spot-check. Edge cases slip through. The first escalations arrive from customers or downstream systems. Prompt changes get made reactively. The supervision ratio quietly climbs from 0.05 to 0.20.
In the scaling phase (months 7-12), either the team hires dedicated agent reviewers (which shows up in cost as a line item nobody projected) or the agent quality degrades visibly (which shows up in customer complaints or downstream errors).
Enterprise deployments typically require between 0.5 and 3 full-time employees for ongoing oversight depending on complexity — those oversight roles are real labor cost that most ROI models never include. Most teams also spend 15-20 hours per month on prompt maintenance per deployed agent, according to enterprise AI cost guides from 2025.
The fix is to model supervision cost explicitly from the start:
True monthly agent cost =
LLM inference cost
+ infrastructure
+ (supervision_ratio × task_volume × loaded_hourly_cost)
+ amortized build cost
+ prompt maintenance (~15-20 hours/month per deployed agent)
For cost management at the infrastructure level, model routing and circuit breakers help. But those reduce inference cost, not supervision cost. Supervision cost is a function of how well the agent handles edge cases — which is an eval problem, not a cost-routing problem.
The teams that succeed long-term obsess over reducing their supervision ratio. They invest in evals that surface edge cases before they hit production (see Agent Evaluation Frameworks). They build output schemas so agents produce structured outputs that downstream systems can validate automatically. They track supervision ratio as a first-class metric, not an afterthought.
How to evaluate your next agent idea in 10 minutes
Three questions, asked in order. If any answer is “no” or “I don’t know,” stop and find out before writing code.
This is the quick-test version of the three-axis framework. It works for initial conversations, not for final go/no-go decisions — but it kills bad ideas fast.
Question 1: What is the loaded cost of the human doing this task today?
Not the salary. The loaded cost: salary + benefits + overhead, divided by productive hours. A $90k/year knowledge worker costs roughly $75-90/hour fully loaded. Multiply by time per task and task volume. If the monthly leverage number is below $5,000, ask whether a simpler automation (a script, a form, a better tool) would work before an agent.
Question 2: What happens when the agent gets it wrong — and how often will that happen?
Think through the worst-case wrong action at realistic volume. If the agent handles 10,000 tasks per month with a 2% error rate, that’s 200 wrong actions monthly. What does each cost to fix? If the answer is “we’d need to audit everything” — that’s a supervision ratio problem. If the answer is “it’s annoying but harmless” — that’s high failure tolerance.
Question 3: What does 12-month ongoing supervision look like?
Name the person. Describe their weekly routine for monitoring this agent. If you can’t describe it concretely, you don’t have an operating model. A concrete operating model includes: who reviews edge cases, how often prompts get updated, how escalations route, and what a “breaking change” from the model provider triggers.
Here is the decision flow:
flowchart TD
A[New agent idea] --> B{Monthly leverage > $10k?}
B -- No --> C[Kill or simplify]
B -- Yes --> D{Failure tolerance: High or Medium?}
D -- Low/Catastrophic --> E[Human-in-loop model only]
D -- High/Medium --> F{Supervision ratio < 0.15 achievable?}
F -- No --> G[Redesign task scope or add output constraints]
F -- Yes --> H[Strong commercial case]
G --> I{After redesign: ratio < 0.15?}
I -- No --> C
I -- Yes --> H
H --> J[Build with eval + supervision cost modeled]
The branch to “Human-in-loop model only” is not a failure — it’s a different product. Agent assistance for low-tolerance tasks is often a very good product. But it should be built and priced as a decision-support tool, not an autonomous agent, and its ROI model looks different.
A note on the “redesign task scope” branch: many agent ideas that initially score poorly can be restructured to score well. The invoice extraction use case started as “process all invoices autonomously” (low tolerance, high supervision) and became commercially successful when scoped to “extract structured fields from standard invoices, flag non-standard formats for human review.” That scoping changed the failure tolerance from Medium to High and cut the supervision ratio by 60%.
What this means in practice
- Task automation leverage sets the ceiling. If the math on time × loaded cost × volume doesn’t work, no amount of clever engineering fixes it.
- Failure tolerance determines the required supervision ratio floor. You can’t get supervision below 1.0 for catastrophic-failure tasks unless the agent is acting in an advisory role only.
- Supervision ratio is the number that kills projects after launch. Model it explicitly, including prompt maintenance time (~20 hours/month) and edge-case review time.
- The three archetypes with the strongest commercial track records share a common structure: high volume, high human loaded cost, and recoverable failure mode. Customer support, document extraction, sales qualification.
- The traps share the opposite structure: enormous leverage that attracts investment, but failure costs that make autonomous operation impossible at any realistic accuracy level.
- Every agent idea should survive a 10-minute economics check before receiving engineering time.
For the deployment decisions that follow from a positive economics check, see Agent Deployment Patterns.
FAQ
How do you calculate the ROI of an AI agent before building it?
Multiply the time saved per task by the loaded cost of the human doing that task, then subtract the agent’s per-task cost — LLM calls plus infrastructure plus amortized build cost. If the margin is positive and the task volume is high enough to recover build cost within 12 months, the ROI case is strong. The harder variable to estimate is supervision ratio: how many human-minutes you still spend per agent action after deployment. Most teams set this to zero in projections and find it’s 0.1-0.3 in practice.
What is task automation leverage for AI agents?
Task automation leverage is the product of time freed per task and the loaded hourly cost of the person doing that task, multiplied by monthly volume. A task that takes a $120/hour paralegal 15 minutes per document, processed 400 times monthly, has $12,000 in monthly leverage. Tasks with low leverage — high volume but very low loaded cost — rarely justify the build and maintenance cost of a production agent, and simpler automations are usually more appropriate.
Why do AI agents fail after launch rather than before?
The supervision ratio rises after launch. In a pilot, one engineer monitors every agent action closely. In production, that same engineer’s attention spreads across thousands of daily actions. Teams that never model ongoing review cost — catching edge cases, handling escalations, maintaining prompts — find supervision eats the margin they projected. Gartner predicts over 40% of agentic AI projects will be scrapped by 2027, with operational cost rather than technical failure as the primary cause.
Which agent use cases have the best commercial track record?
Customer support deflection, document extraction (invoices, contracts, forms), and sales qualification have the strongest commercial records. Klarna’s support agent saved $60 million and cut resolution time from 11 minutes to under 2 minutes. GravityStack cut credit agreement review time by 200%. These archetypes share three properties: high task volume, high human loaded cost, and low catastrophic failure risk — a wrong action costs a small fix, not an irreversible outcome.
What is supervision ratio and why does it matter for agent economics?
Supervision ratio is the human-minutes spent per agent action in production — reviewing outputs, catching edge cases, handling escalations, maintaining prompts. An agent with a 0.3 supervision ratio requires 5 hours of human attention per 1,000 daily tasks. Most teams model zero ongoing supervision in their ROI projections. Real production deployments typically run 0.1 to 0.3, which means 1-3 people spending meaningful time each day on a mid-scale deployment. Sustainable agents target below 0.1.
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch