Hybrid agentic workflows: when to use an LLM node vs a code node
“If you can write a unit test for it, it should not be an LLM call.”
TL;DR
Every agentic workflow mixes reasoning (flexible, slow, expensive) with execution (rigid, fast, cheap). Most teams default to LLM nodes for everything — paying 3-5x in latency and cost for tasks that have deterministic solutions. HyEvo’s research shows evolved hybrid graphs achieve up to 19x cost reduction on code generation tasks. This post derives a decision framework: code when the transformation is unambiguous, LLM when the input space is too large, hybrid supervisor when you need both. For the static workflow patterns this builds on, see agent workflow patterns.

Why do teams over-use LLM nodes?
Because LLM nodes are easy to write and hard to get wrong at demo time.
Need to extract a date from text? An LLM call handles it in one prompt. Need to validate JSON? An LLM can check it. Need to route a request to the right handler? Ask the LLM. Each of these works reliably in development with a few dozen test cases.
The problem surfaces at production scale. That date extraction runs 50,000 times per day. The JSON validation sits in a hot path called on every API request. The router handles every incoming message. At those volumes, the gap between an LLM node and a code node becomes painful:
| Operation | Code node | LLM node | Ratio |
|---|---|---|---|
| JSON validation | <1ms, ~$0 | 200-500ms, ~$0.01 | 500x latency |
| Date extraction | <1ms (regex) | 300-800ms, ~$0.02 | 800x latency |
| Format conversion | <1ms | 200-400ms, ~$0.01 | 400x latency |
| Intent classification | N/A (too many intents) | 200-500ms, ~$0.02 | LLM needed |
| Free-text summarization | N/A | 500-2000ms, ~$0.05 | LLM needed |
The bottom two rows are where LLM nodes earn their cost. The top three are where code nodes should replace them. The pattern: if the transformation has a finite, enumerable input-output mapping, code wins. If the input space is open-ended or the task requires language understanding, the LLM is necessary.
How do you decide which node type a step needs?
Three questions, in order.
1. Can I write a unit test that covers 95%+ of inputs?
If yes → code node. JSON schema validation, regex-based extraction, format conversion, arithmetic, API response parsing, error code routing — these are all testable with deterministic assertions. An LLM adds nothing here except cost and latency.
2. Is the input space too large to enumerate?
If yes → LLM node. Free-text classification where users can say the same thing a thousand different ways. Summarization of arbitrary documents. Deciding which of 50 tools to call based on a natural language request. These tasks have combinatorial input spaces that rules cannot cover.
3. Does the step need both reasoning and execution?
If yes → hybrid pattern. The LLM decides what to do; code does it. An LLM classifies the user’s intent, then a code node routes to the correct handler. An LLM plans which database queries to run, then a code node executes them and validates results.
graph TD
A[New workflow step] --> B{Can I write<br/>unit tests for 95%<br/>of inputs?}
B -->|Yes| C[Code node]
B -->|No| D{Does it need<br/>language understanding<br/>or generation?}
D -->|Yes| E[LLM node]
D -->|No| F[Probably code node<br/>with edge-case fallback]
E --> G{Does execution<br/>need to be<br/>deterministic?}
G -->|Yes| H[Hybrid: LLM decides,<br/>code executes]
G -->|No| E
What does the hybrid supervisor pattern look like?
The most common production pattern concentrates LLM calls at decision points and code at execution points.
graph LR
A[User input] --> B[LLM Router<br/>Classify intent]
B -->|billing| C[Code: fetch invoice<br/>format response]
B -->|technical| D[Code: search KB<br/>retrieve docs]
B -->|complex| E[LLM: generate<br/>custom response]
C --> F[Code: validate output<br/>apply template]
D --> F
E --> F
F --> G[Response]
The LLM node (Router) handles the one step that requires flexibility — classifying ambiguous user intent. Everything downstream is code: database queries, template formatting, output validation. If 70% of requests route to billing or technical (the code paths), the workflow uses the LLM for 30% of requests. Cost drops proportionally.
A real document processing pipeline I have seen in production illustrates the cost difference:
Before (pure LLM): Extract fields → Validate schema → Normalize dates → Classify document type → Generate summary. Five LLM calls per document. At 10,000 documents per day: ~$1,500/month in API costs, P95 latency 8 seconds.
After (hybrid): Classify document type (LLM) → Extract fields (code, regex + rules) → Validate schema (code, JSON schema) → Normalize dates (code, dateutil) → Generate summary (LLM). Two LLM calls per document. Cost: ~$600/month. P95 latency: 3 seconds. Same accuracy on the extraction and validation steps — they were deterministic to begin with.
What did HyEvo discover about optimal hybrid graphs?
HyEvo (arXiv 2603.19639) automates this design decision. Rather than a human deciding which nodes should be LLM vs code, HyEvo uses evolutionary search to discover optimal graphs.
The system represents workflows as directed acyclic graphs with typed nodes. LLM nodes have a backbone model, instructions, and temperature. Code nodes have synthesized source code with typed I/O signatures. Evolution generates new graph topologies, evaluates them on task performance and efficiency, and selects for the best designs.
The reflect-then-generate mechanism is where the intelligence lives. A meta-agent examines why a workflow failed — was the LLM node hallucinating on a task that code could handle deterministically? Was the code node too rigid for inputs it had never seen? — and proposes structural changes. Replace this LLM node with code. Add a fallback LLM node for edge cases this code node cannot handle. Merge these two sequential LLM nodes into one.
The results: up to 19x cost reduction on code generation tasks and up to 16x latency reduction on code tasks compared to AFlow (the prior state-of-the-art in automated workflow optimization), with comparable or better accuracy across GSM8K (93.36%), MATH (53.91%), and HumanEval (93.89%).
The 19x number deserves scrutiny. It likely comes from AFlow using more LLM calls than necessary — the same over-reliance on LLM nodes that human-designed workflows exhibit. HyEvo finds the code-node replacements that humans miss because they optimize for development speed, not runtime cost.
For more on how these evolved architectures relate to the broader self-evolution trend, see self-evolving agent architectures.
What are the risks of aggressive code-node replacement?
Two failure modes to watch for.
Brittleness at the edges. A code node that handles 95% of inputs fails silently on the other 5%. An LLM node handles the edges gracefully, even if slowly. The mitigation: add an LLM fallback. When the code node encounters an input it cannot process (parsing failure, confidence below threshold, unrecognized format), route to an LLM node. This costs 5% of LLM overhead instead of 100%.
Maintenance burden. Code nodes require traditional software maintenance — bug fixes, input format changes, dependency updates. LLM nodes adapt to format changes automatically (within limits). If your input schema changes monthly, the maintenance cost of code nodes may exceed the inference cost savings. Stable interfaces favor code nodes. Volatile interfaces favor LLM nodes.
Key takeaways
- If you can unit-test it, use code. JSON validation, format conversion, date extraction, API parsing — none of these need an LLM.
- If the input space is open-ended, use an LLM. Free-text classification, summarization, tool selection — these require language understanding.
- The hybrid supervisor pattern concentrates cost. LLM for decisions, code for execution. Most production workflows should use LLM nodes for 20-40% of steps, code for the rest.
- HyEvo proves 19x savings are achievable. Evolved hybrid graphs replace unnecessary LLM nodes with code, cutting cost and latency dramatically.
- Add LLM fallbacks to code nodes. Handle the 95% case with code, the 5% edge case with an LLM. Best of both worlds.
Further reading
- Self-evolving agent architectures — how HyEvo and SAGE automatically optimize workflow design
- Agent workflow patterns — the foundational patterns for static workflows
- Cost management for agents — FinOps for LLM-based agent systems
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch