Architecting Conversational AI Systems
“A voice assistant is more than a speech recognizer attached to a search engine. It is a stateful entity that must navigate the social nuances of human turn-taking and intent.”
TL;DR
Conversational AI systems architect the transition from single-shot voice commands to multi-turn stateful interactions. The five core modules – ASR, NLU, dialog state tracking, dialog policy, and TTS – must all operate within a tight 700ms latency budget to feel natural. Streaming orchestration is the key enabler: NLU starts while the user is still speaking, and TTS streams audio before generation completes. Advanced features like barge-in handling (with acoustic echo cancellation), contextual ASR boosting (biasing the language model based on dialog state), and coreference resolution separate production systems from demos. For the voice agent security dimension, see the security for voicebots post. For the underlying speech-to-speech architecture, see speech-to-speech agents.

1. Introduction: Beyond Single-Shot Commands
The first generation of voice interfaces (2014-2018) defined the era of Single-Shot Commands:
- “Hey Google, turn on the lights.”
- “Alexa, what’s different about a sloth and a koala?”
The technical architecture was simple: Wake Word → ASR → NLU → Action → Forget. The system had no memory. Every interaction was “Day 0.”
Conversational AI (2025+) is the transition to Multi-turn Interaction. It requires the system to maintain a Dialog State across minutes or even hours of conversation. It must handle interruptions (Barge-In), corrections (“No, I meant Boston, UK”), and anaphora (“Send it to him”).
In this post, we architect a full-stack Conversational AI system. We will not look at high-level API wrappers; we will dissect the Dialog Manager, the handling of Full-Duplex Audio, and the Latency Budgeting required to make a machine feel human.
2. The Functional Blueprint: The Five Core Modules
- ASR (Acoustic-to-Text): Translating vibrations into a list of candidate sentences.
- NLU (Intent & Entity): Decoding the user’s meaning and parameters.
- DST (Dialog State Tracking): The “Short-term Memory” of the current flow.
- DP (Dialog Policy): The “Decision Maker” that decides the next system action.
- NLG & TTS (Text-to-Speech): Generating a natural response and speaking it.
3. High-Level Architecture: The Latency Budget
In a voice conversation, a delay of > 700ms feels “awkward.” A delay of > 2000ms is a system failure.
| Module | Latency Budget |
|---|---|
| ASR | 200ms |
| NLU & DST | 100ms |
| External API (e.g., Weather) | 300ms |
| TTS (First chunk) | 100ms |
| Total | 700ms |
To achieve this, we use Streaming Orchestration. NLU starts processing while the user is still speaking. TTS starts speaking while the audio is still being generated.
4. Implementation: The Dialog State Machine
In complex domains like flight booking, simple “intent matching” fails. We use a Frame-based State Machine.
class DialogManager:
def __init__(self):
self.state = "IDLE"
self.slots = {"destination": None, "date": None}
def process_turn(self, nlu_result):
# 1. Update Slots from NLU
for entity in nlu_result.entities:
if entity.type in self.slots:
self.slots[entity.type] = entity.value
# 2. Transition Logic (The State Machine)
if not self.slots['destination']:
return "Where are you flying to?", "AWAIT_DESTINATION"
if not self.slots['date']:
return f"Understood. When do you want to fly to {self.slots['destination']}?", "AWAIT_DATE"
return "Everything is set. Booking now.", "FINISHED"
5. Advanced: Handling “Speech Interruptions” (Barge-in)
How does a system know when to stop talking if the user interrupts?
- Full-Duplex Audio: The system is always “Listening” while “Speaking.”
- Acoustic Echo Cancellation (AEC): The system must “subtract” its own voice from the incoming audio stream.
- Latency Connection: If interrupt detection is slow, the system feels less intelligent.
6. Real-time Implementation: Contextual ASR
The ASR model shouldn’t just be a general model. It should be State-Aware.
- If the Dialog Manager is in the
AWAIT_DATEstate, the ASR’s Language Model should be weighted toward numbers and months. - This “Contextual Boosting” reduces the WER significantly for ambiguous words.
7. Comparative Analysis: Rule-based vs. LLM-based Dialog
| Metric | Rule-based (Rasa/Dialogflow) | LLM-based (GPT-4 / Claude) |
|---|---|---|
| Control | Absolute | Probabilistic |
| Consistency | High | Medium |
| Complexity | High (Human setup) | Low (Zero-shot) |
| Best For | Banking / Booking | General Chat / Therapy |
8. Failure Modes in Conversational AI
- Dialog State Collapse: The user changes their mind multiple times, and the state machine gets stuck.
- Mitigation: Implement a “Reset State” command and confidence threshold.
- Entity Confusion: Ambiguity in multi-entity sentences.
- Vocal Sarcasm: Misinterpreting positive words used sarcastically.
9. Real-World Case Study: Amazon Alexa’s Contextual Reasoning
Alexa uses a specialized component for Short-term Goal Tracking.
- If you ask “How tall is the Eiffel Tower?” and then “When was it built?”, Alexa performs Coreference Resolution to map “It” back to “Eiffel Tower” from previous turns.
10. Key Takeaways
- State is Soul: A conversation is a trajectory through a state space.
- Latency is the UI: A slow voice system is a broken voice system.
- Hybridization wins: Use LLMs for general chat but hard state machines for transactions.
- Feedback Loops: Use data from failed turns to refine your state transitions.
FAQ
What is the latency budget for a conversational AI system?
A voice conversation feels awkward above 700ms delay and broken above 2000ms. The typical budget allocates 200ms for ASR, 100ms for NLU and dialog state tracking, 300ms for external API calls, and 100ms for TTS first chunk. Streaming orchestration is essential – NLU starts processing while the user is still speaking, and TTS starts before audio generation completes.
How does a dialog state machine work for voice assistants?
A frame-based dialog state machine maintains slots (like destination and date for a flight booking) that are filled progressively from NLU entity extraction. The state transitions determine what question to ask next based on which slots are missing, enabling multi-turn conversations that feel natural rather than requiring the user to provide everything in one utterance.
What is barge-in and how do conversational AI systems handle it?
Barge-in is when a user interrupts the system while it is speaking. Full-duplex audio systems listen continuously while speaking, using Acoustic Echo Cancellation (AEC) to subtract the system’s own voice from the incoming audio stream. Fast interrupt detection is critical for making the system feel intelligent and responsive.
When should you use rule-based versus LLM-based dialog management?
Rule-based systems (Rasa, Dialogflow) offer absolute control and high consistency, making them ideal for banking and booking flows where predictability matters. LLM-based systems (GPT-4, Claude) offer low setup complexity and zero-shot capabilities, better suited for general chat and open-domain conversations. Hybrid approaches combine hard state machines for transactions with LLMs for general interaction.
Originally published at: arunbaby.com/speech-tech/0058-conversational-ai-system
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch