Designing a Global-Scale Conversational AI: The Chatbot System Design
“Building a chatbot that responds is easy. Building a conversational system that remembers, reasons, and scales to millions of concurrent users without melting your GPU cluster is an engineering feat of the highest order.”
TL;DR
A production chatbot at 200M DAU requires three architectural layers: ingress (API gateway with WebSocket/SSE support), orchestration (context management, RAG retrieval, prompt construction), and intelligence (GPU inference with vLLM and PagedAttention). The math is staggering: 10 billion daily requests, 300K peak RPS, and 5TB of new text per day. PagedAttention enables efficient KV cache management, speculative decoding with a small drafter model multiplies throughput by 2x, and a hierarchical memory system from VRAM to vector databases provides both session context and long-term recall. For the optimization techniques that power the inference layer, see the inference optimization guide and the advanced caching strategies that reduce GPU load.

1. Introduction: Beyond the Text Box
When we talk about a “Chatbot” today, we aren’t talking about the rule-based decision trees of 2015. We are talking about Large Language Model (LLM) powered entities that function as interfaces to human knowledge. From an engineering perspective, a system like ChatGPT or Claude is not just a model; it is a complex, stateful, multi-modal distributed system.
At the intersection of Google’s search infrastructure and Meta’s social graph lies the challenge of the modern Chatbot: it must be as fast as a search engine, as personal as a private message, and as reliable as a banking transaction.
In this deep dive, we will architect a chatbot system capable of handling 200 million daily active users (DAU), supporting real-time streaming, long-term memory, and Retrieval-Augmented Generation (RAG). This is the “Principal Engineer’s” blueprint for the most important software paradigm shift of the decade.
2. Problem Statement & Requirements
To design a system, we must first define the bounds of the sandbox.
2.1 Functional Requirements
- Low-Latency Conversational Interface: Users expect “streaming” responses (Time To First Token < 500ms).
- Stateful Conversations: The system must maintain context over multiple turns (Chat History).
- Multi-Modal Support: Ability to process text, images, and documents (PDFs) within the conversation.
- Retrieval-Augmented Generation (RAG): The chatbot must be able to “search” for facts outside its training data to provide accurate, up-to-date information.
- Multi-Platform Access: Web, mobile, and API access with consistent state.
- Human-like Personality & Safety: Guardrails against toxic content, PII leakage, and hallucinations.
2.2 Non-Functional Requirements
- High Availability: 99.9% uptime (global availability).
- Scalability: Handling 100k+ requests per second (RPS) during peak times.
- Consistency: Chat history must be consistent across devices.
- Cost Efficiency: Managing the high cost of GPU inference through optimizations.
- Data Security: Strict adherence to GDPR/CCPA, ensuring user data privacy.
2.3 Back-of-the-Envelope Calculations (The Reality Check)
- Users: 200M DAU.
- Average Sessions per User: 5 sessions/day.
- Average Turns per Session: 10 turns.
- Total Daily Inference Requests: 200M * 5 * 10 = 10 Billion requests/day.
- Requests Per Second (Average): 10B / 86400 ≈ 115,000 RPS.
- Peak RPS: ~2.5x average ≈ 300,000 RPS.
- Storage: 10B turns/day * 500 characters/turn ≈ 5 Terabytes/day of text history.
- Compute: A single H100 with vLLM might handle ~50-100 concurrent streams depending on model size (e.g., Llama-3 70B). We need thousands of GPUs.
3. High-Level Architecture: The Nervous System of AI
The architecture is split into the Control Plane (UI, API Gateway, Orchestrator) and the Data Plane (LLM Serving, Vector DB, Knowledge Base).
[ User Interface ] <-- (WebSockets / SSE) --> [ API Gateway ]
| |
v v
[ CDN / Edge ] <---- [ Auth / Rate Limit ] <---- [ Orchestrator ]
|
+-----------------------+--------------------+-----------------------+
| | | |
## 3. High-Level Architecture: The Nervous System of AI
The architecture is split into three primary layers: the **Ingress Layer**, the **Orchestration Layer**, and the **Intelligence Layer**. This separation ensures that scaling the user interface doesn't require scaling the expensive GPU resources proportionally.
### 3.1 The Global Blueprint
```text
[ USER CLIENTS ]
(Web, iOS, Android, Desktop)
|
v
[ EDGE NETWORK ] <----------------> [ GLOBAL LOAD BALANCER ]
(Cloudflare/Akamai) (Anycast IP, TLS Term)
|
v
[ API GATEWAY / BORDER ] <--------> [ AUTH & RATE LIMITER ]
(Kong / Envoy / App Mesh) (JWT, OAuth2, Redis Quotas)
|
+-----------------------+
| |
v v
[ ORCHESTRATION LAYER ] <-------> [ CHAT ORCHESTRATOR ] [ AGENT WORKERS ]
(Go / Rust Microservices) (Context Mgmt, RAG) (Tool Use, Python)
| |
+-----------+-----------+
|
v
[ STORAGE & STATE ] <-----------+-----------+-------+-------+-----------+
| | | |
[ CACHE ] [ HISTORY DB ] [ VECTOR DB ] [ BLOB STORE ]
(Redis) (Cassandra) (Pinecone) (S3/GCS)
| | | |
+-----------+-------+-------+-----------+
|
v
[ INTELLIGENCE LAYER ] <------------------> [ INFERENCE ENGINE ]
(GPU Clusters / K8s) (vLLM / TensorRT-LLM)
|
[ LLM / VFM MODELS ]
(Llama 3, GPT-4o, etc.)
3.2 Component Roles in Detail
- API Gateway: Handles request routing, SSL termination, and provides a unified interface. For a chatbot, this gateway must support long-lived connections (WebSockets) and streaming responses (SSE).
- Chat Orchestrator: This is the stateful (or semi-stateful) layer that coordinates between history databases, vector databases, and the inference engine. It is responsible for Prompt Engineering transforming the raw user input into a rich, context-aware prompt.
- Inference Engine: The actual GPU-accelerated service. We use specialized engines like vLLM because they implement advanced memory management (PagedAttention) which is the difference between serving 10 users per GPU or 1,000.
- Vector Database: Stores high-dimensional representations (embeddings) of structured and unstructured knowledge. When a user asks a specific question, we query this database to find relevant documents to “ground” the model’s response.
- History Database: A high-write-throughput database like Cassandra or ScyllaDB. Every turn in every conversation is a record. At scale, this database grows by petabytes.
4. Requirement Deep-Dive: Scaling the “Unscalable”
4.1 Traffic Analysis (The Principal Engineer’s Spreadsheet)
Let’s look at the numbers again. If we have 200 million daily active users (DAU), and they average 50 messages a day (typical for power users), we are looking at 10 billion messages.
- Inbound Bandwidth: 10B * 1KB/message ≈ 10 TB/day.
- Outbound Bandwidth: LLM responses are longer. 10B * 4KB/response ≈ 40 TB/day.
- Write Operations: 20B writes/day (1 for user, 1 for bot) ≈ 230k writes per second.
- Storage (7-year retention): 50 TB/day * 2500 days ≈ 125 Petabytes.
This is not a “simple” database problem. This requires a sharded, multi-regional database architecture where we partition by session_id or user_id.
5. Component Deep-Dives: Engineering the Invisible
5.1 The Orchestrator: Context Management & The Sliding Window
An LLM has a finite “context window” (e.g., 128k tokens for GPT-4). However, as a conversation grows, you can’t just keep appending messages. Eventually, you run out of space, and the cost per request explodes (since you pay per token processed).
Strategies for Context Truncation:
- First-In-First-Out (FIFO): Drop the oldest messages. Simple but loses the “start” of the conversation which often contains instructions.
- Summarization Cache: When the context reaches 75% of the limit, trigger an async job to summarize the oldest 50% into a concise paragraph. Replace those 50% tokens with the summary.
- Selective Memory (Entity-Based): Extract key entities (User’s name, preferences, discussed topics) and store them in a structural “User Profile” in the History DB. Inject this profile into every prompt.
5.2 The LLM Serving Engine: The Math of KV Caching
Why do we need specialized engines? To understand this, we must understand the KV Cache. During LLM generation, the model predicts one token at a time. To predict the $N$-th token, it needs the activations of all previous $N-1$ tokens. Instead of recomputing these every time (which would be $O(N^2)$), we store the results in a Key-Value (KV) cache.
The Memory Problem: A Llama-3 70B model with a 128k context window:
- KV cache size per token: ~1 MB (depending on precision and layer count).
- Max context for one user: 128,000 tokens * 1MB ≈ 128 GB.
- A single A100 GPU has only 80GB of VRAM.
The Solution: PagedAttention Just like Operating Systems use paging to handle fragmented memory, vLLM uses PagedAttention. It breaks the KV cache into small blocks.
- Zero Fragmentation: Memory is only allocated when needed.
- Shared Memory: If multiple users are asking questions about the same document (e.g., a shared corporate policy), the KV cache for that document can be shared across multiple requests, saving GIGABYTES of VRAM.
5.3 Advanced Retrieval-Augmented Generation (RAG)
Standard RAG (retrieve and dump) often fails because the retrieved context is too noisy or poorly formatted.
The “Agentic RAG” Pipeline:
- Query Decomposition: If a user asks “Compare the Q3 earnings of Apple and Tesla,” the system splits this into two sub-queries: “Apple Q3 earnings” and “Tesla Q3 earnings.”
- Hybrid Search: Combining Vector Search (semantic similarity) with BM25 (keyword matching). This ensures that if a user searches for a specific part number, they find it exactly.
- HyDE (Hypothetical Document Embeddings): The system first asks an LLM to “write a fake answer.” It then uses that fake answer to search the vector database. This works because the fake answer is closer in “semantic space” to the real answer than the user’s question is.
- Reranking: After getting the top 100 documents from the Vector DB, we use a slower but more accurate Cross-Encoder model to re-score them and take the top 5.
6. Implementation: The Scalable Streaming Orchestrator
Below is a production-grade Python implementation of an asynchronous orchestrator. It handles RAG, Context Management, and Streaming.
import asyncio
import json
import logging
import time
from typing import AsyncGenerator, List, Optional
import aiohttp
from pydantic import BaseModel
import redis.asyncio as redis
from motor.motor_asyncio import AsyncIOMotorClient
# Set up logging for observability
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class Message(BaseModel):
role: str
content: str
class ChatConfig:
MAX_CONTEXT_TOKENS = 4096
RECOVERY_THRESHOLD = 0.8 # Summarize after 80% usage
EMBEDDING_MODEL = "text-embedding-3-small"
LLM_ENDPOINT = "http://inference-internal.ai.corp/v1/chat/completions"
class AdvancedOrchestrator:
def __init__(self, session_id: str):
self.session_id = session_id
self.redis = redis.from_url("redis://localhost:6379/0", decode_responses=True)
self.db = AsyncIOMotorClient("mongodb://localhost:27017").ai_platform
self.history_key = f"messages:{session_id}"
async def _get_recent_history(self) -> List[Message]:
"""Fetches and reconstructs message history with caching."""
cached = await self.redis.lrange(self.history_key, 0, -1)
if cached:
return [Message(**json.loads(m)) for m in cached]
# Fallback to persistent storage
cursor = self.db.history.find({"session_id": self.session_id}).sort("ts", 1)
history = [Message(role=d["role"], content=d["content"]) for d in await cursor.to_list(None)]
# Rehydrate cache
if history:
await self.redis.lpush(self.history_key, *[json.dumps(m.dict()) for m in history])
return history
async def _retrieve_context(self, query: str) -> str:
"""semantic RAG step: Vector DB lookup."""
# In a real system, this would call Pinecone/Milvus
# 1. Generate Embedding
# 2. Vector Search
# 3. Filter by metadata (namespace/user_id)
return "Reference: The system uses PagedAttention for KV cache efficiency."
async def stream_chat(self, user_input: str) -> AsyncGenerator[str, None]:
"""The main loop: Retrieve -> Augment -> Generate -> Persist."""
# 1. Parallel History & Context Loading
history, context_docs = await asyncio.gather(
self._get_recent_history(),
self._retrieve_context(user_input)
)
# 2. Construct the system prompt with RAG grounding
system_prompt = f"You are a helpful assistant. Use this context: {context_docs}"
messages = [{"role": "system", "content": system_prompt}]
messages.extend([m.dict() for m in history])
messages.append({"role": "user", "content": user_input})
# 3. Streaming Request to Inference Cluster
full_response = ""
async with aiohttp.ClientSession() as session:
payload = {
"model": "llama-3-70b-pro",
"messages": messages,
"stream": True,
"temperature": 0.2
}
async with session.post(ChatConfig.LLM_ENDPOINT, json=payload) as resp:
if resp.status != 200:
yield "Error: Service Unavailable"
return
async for line in resp.content:
if line:
decoded_line = line.decode('utf-8').strip()
if decoded_line.startswith("data: "):
if decoded_line == "data: [DONE]":
break
try:
chunk = json.loads(decoded_line[6:])
token = chunk['choices'][0]['delta'].get('content', '')
full_response += token
yield token
except Exception as e:
logger.error(f"Parse error: {e}")
# 4. Asynchronous State Persistence
asyncio.create_task(self._persist_turn(user_input, full_response))
async def _persist_turn(self, user_text: str, bot_text: str):
"""Heavy lifting for DB writes moved out of the hot path."""
ts = time.time()
turns = [
{"session_id": self.session_id, "role": "user", "content": user_text, "ts": ts},
{"session_id": self.session_id, "role": "assistant", "content": bot_text, "ts": ts + 0.1}
]
# Write to Mongo
await self.db.history.insert_many(turns)
# Update Redis Cache
await self.redis.rpush(self.history_key, json.dumps(turns[0]), json.dumps(turns[1]))
await self.redis.ltrim(self.history_key, -50, -1) # Keep only last 25 turns (50 items) in cache
7. Scaling & Optimization: The Billion Dollar Infrastructure
7.1 Model Serving Strategies: Throughput vs Latency
In a massive system, you have different traffic types. We use Route Optimization to handle this:
| Traffic Type | Example | Optimization Strategy |
|---|---|---|
| B2C Chat | Daily user banter | High Throughput: Quantized (FP8/INT4) models, high batch sizes (256+). |
| Enterprise RAG | Legal document analysis | High Precision: Full FP16 precision, Speculative Decoding for speed. |
| API/Automation | Systematic data extraction | Batch Processing: Lower priority queues, utilized on “Spot Instances.” |
7.2 The Speculative Decoding Architecture
To achieve sub-50ms token latency on high-quality models, we use a “Draft-Verify” loop.
- Draft: A tiny 1B parameter model (the Drafter) guesses the next 5 tokens.
- Verify: The 70B model (the Oracle) checks all 5 tokens in a single parallel step.
- Result: If 4 tokens were right, we just saved 3 round-trips through the big model. Hardware throughput increases by 1.8x to 2.5x.
7.3 Multi-Modal Data Flow: Handling Images & PDFS
When a user uploads an image (“What’s in this picture?”), the system doesn’t just send pixels to the LLM.
- Vision Encoder: A model like CLIP or a ViT (Vision Transformer) converts the image into a 1024-dimension vector.
- Projection Layer: A small bridge network maps the image vector into the “Word Embedding” space of the LLM.
- Interleaved Inference: The LLM treats the image as a sequence of “Visual Tokens” similar to text tokens.
8. Database Architecture: Trillion-Row Scalability
For chat history, RDBMS (PostgreSQL/MySQL) will fail due to the sheer volume of writes and the need for global replication.
8.1 The ScyllaDB/Cassandra Schema
We use a Wide Column Store approach.
- Partition Key:
session_id(Ensures all messages for one chat are on the same physical node). - Clustering Key:
message_id(Sorted by time for efficient range queries).
CREATE TABLE chat_history (
session_id UUID,
message_id TIMEUUID,
role TEXT, -- 'user', 'assistant', 'system'
content TEXT,
metadata JSONB, -- For storing token counts, model versions
PRIMARY KEY (session_id, message_id)
) WITH CLUSTERING ORDER BY (message_id ASC);
Global Distribution:
We use Multi-Region Replication. If a user travels from NY to London, the Local Load Balancer directs them to the eu-west-1 stack. The data is replicated asynchronously between regions, achieving < 500ms cross-continent sync.
9. Monitoring & Guardrails: Safety at Scale
9.1 The “Jailbreak” Detection Pipeline
We implement a Three-Layer Guardrail:
- Pre-Inference: A regex + FastText model checks for “forbidden words” and dangerous patterns (e.g., prompt injection).
- In-Inference: Token-level stopping. If the model starts generating the string
curl http://..., the inference engine kills the stream immediately. - Post-Inference: A “Moderation API” (like OpenAI’s moderation-latest) checks the full response before it’s finalized in the DB.
9.2 Observability Dashboard
Principal engineers look at Distributions, not averages.
- P99 TTFT: The experience of the unluckiest 1% of users. If this spikes, your inference queue is saturated.
- Token Convergence: Ratio of generated tokens to prompt tokens. Helps in calculating the ROI of the system.
- Prompt-Cache Hit Rate: % of queries that reused a previously computed KV cache.
10. Cost Analysis: The Unit Economics of AI
| Item | Cost (H100 instance) | Monthly (scaled) |
|---|---|---|
| Model Weights | $0 (Open Source / Llama) | $0 |
| GPU Rental (8x H100) | $24 / hour | $17,280 / node |
| Typical Cluster (100 Nodes) | $2,400 / hour | $1.7M / month |
| Token Cost (Internal) | ~$0.005 / 1k tokens | Depends on usage |
How to reduce this?
- Distillation: Train a smaller model to act as a specialist for common tasks (e.g., “Summarization”).
- Quantization: Moving from 16-bit to 4-bit weights reduces RAM needs by 75% without significant quality loss.
11. Real-World Failure Modes & Mitigation
- The Thundering Herd: 1M users all message at once (e.g., during a major event).
- Mitigation: Adaptive Rate Limiting. Tier users based on subscription status or historical value.
- Context Poisoning: User provides a 50-page PDF that “brainwashes” the model into giving bad advice.
- Mitigation: Isolated RAG environments. The model doesn’t “know” the document is true; it is instructed to treat it as “External provided info.”
- Silent Drift: The model’s answers become shorter and less helpful over time.
- Mitigation: Weekly Golden Dataset benchmarking. Compare production outputs against a set of 1,000 “Perfect Answers.”
12. Future Outlook: The Agentic Web
We are moving away from “Chat” and toward “Task Orchestration.” The next version of this architecture replaces the Chat Orchestrator with an Agent Loop.
- Tools: Browsers, Python interpreters, and API clients.
- Reasoning: ReAct (Reason + Act) patterns where the model thinks, acts, observes, and repeats.
- Long-term Personalization: Storing user “Archetypes” in Knowledge Graphs.
14. Component Deep-Dive: The API Gateway & Connection Management
In a standard web app, HTTP is stateless. In a chatbot, we need Streaming (SSE) or Persistent (WebSockets) connections. This introduces the “Sticky Session” problem at the Load Balancer level.
14.1 Handling 1 Million Concurrent WebSockets
When you have a million users connected via WebSockets, you cannot just reboot a load balancer. Every disconnection triggers a “Thundering Herd” where a million clients try to reconnect simultaneously, potentially DOS-ing your Auth service.
Principal Engineer’s Approach:
- Consistent Hashing: Use a consistent hashing ring at the Load Balancer (e.g., Nginx with
ip_hashor Envoy withmaglev). This ensures that if a specific gateway node stays up, the user session stays on that node. - Connection Bloat: Each open WebSocket takes ~10-50KB of memory on the server. A single 64GB RAM node can theoretically handle ~1M connections, but context-switching and TCP overhead usually cap this at 100k-200k.
- Backpressure: If the LLM serving engine is slow, the Orchestrator must apply backpressure to the Frontend. We use a Token Bucket Algorithm per user session to ensure no one user starves the system.
14.2 The “Graceful Shutdown” Pattern
To avoid the Thundering Herd during deployments:
- The Gateways stop accepting new connections.
- Existing connections are slowly closed over a 15-minute window (e.g., closing 1% of connections every 10 seconds).
- Clients implement Exponential Backoff with Jitter on reconnections to spread the load.
15. The “Memory” System: From Silicon to Intelligence
Static RAG is often called “Short-term Memory.” But what if a user told you their name in 2023 and expects you to remember it in 2025?
15.1 Hierarchical Memory Architecture
| Memory Tier | Storage Medium | Latency | Purpose |
|---|---|---|---|
| L1: Context | GPU VRAM | < 1ms | Immediate tokens in the current sliding window. |
| L2: Session Cache | Redis | 2-5ms | Full history of the current chat session. |
| L3: Profile Store | DynamoDB / Cassandra | 10-20ms | Long-term user preferences (Extracted via LLM). |
| L4: Knowledge | Vector DB (Pinecone) | 50-100ms | Global facts and uploaded documents. |
15.2 MemGPT Concept: The OS for LLMs
Inspired by the MemGPT research, we design our orchestrator to act like an Operating System:
- Virtual Memory: The model can “Swap” context to the database and “Page” it back in when needed.
- Interrupters: The system can triggeer a “Recall” function if it detects the user is asking about a past event that is not in the current context.
16. The Evaluation Pipeline: The Secret Sauce of Quality
A model that passes benchmarks (MMLU) might still be a terrible chatbot. We need a Systemic Evaluation pipeline.
16.1 LLM-as-a-Judge
We use a “Gold Standard” model (e.g., GPT-4o) to grade our “Candidate” model (e.g., Llama-3 fine-tune). Rubric-based Grading:
- Faithfulness: Does the answer stay within the provided context? (Score 1-5)
- Relevance: Does it actually answer the user’s question? (Score 1-5)
- Tone: Is the persona consistent? (Score 1-5)
16.2 A/B Testing in AI
Traditional A/B testing measures clicks. AI A/B testing measures User Retention and Session Depth.
- Experiment A: Context window of 8k tokens.
- Experiment B: Context window of 16k tokens.
- Metric: Is the user more likely to ask a follow-up question in Group B? If so, the higher GPU cost of B might be justified by higher engagement.
17. Deployment & Infrastructure: Kubernetes for AI
Deploying AI on K8s requires specialized resource management. Standard HPA (Horizontal Pod Autoscaler) on CPU is insufficient.
17.1 The “Custom Metrics” Autoscaler
We use KEDA (Kubernetes Event-driven Autoscaling) to scale GPU pods based on:
- Request Queue Length: How many users are waiting for a token?
- Inference Throughput: Are our GPUs hitting their FLOP limits?
17.2 Tainting and Tolerations
GPU nodes are expensive. We “Taint” them so that standard web services (like the dashboard) don’t accidentally land on an H100 node.
- Taint:
dedicated=gpu:NoSchedule - Toleration: Only the inference engine pods have the toleration to be scheduled on these nodes.
17.3 Multi-Cluster Fleet Management
OpenAI or Meta don’t run in a single region. They use a Global Traffic Manager (GTM).
- If
us-east-1is at 90% GPU capacity, incoming requests are routed toeu-west-1at the cost of slight network latency (100ms vs 10ms). - Cross-Region KV Caching: We replicate the first part of common prompts (System Prompts) globally so that the KV cache starts “hot” even in a new region.
18. Training & Fine-Tuning: The RLHF Cycle
To get that “ChatGPT feel,” the base model must undergo Alignment.
18.1 Supervised Fine-Tuning (SFT)
We use a high-quality dataset of ~100k “Perfect Conversations.” We use LoRA (Low-Rank Adaptation) to train only 1% of the model’s weights, which is 10x faster and requires 4x less VRAM.
18.2 Reinforcement Learning from Human Feedback (RLHF)
- Ranking: Humans are shown two model outputs and asked to pick the better one.
- Reward Model: A model is trained to predict the human preference.
- PPO/DPO: The chatbot model is trained to maximize the “Reward” from the Reward Model. DPO (Direct Preference Optimization) is often preferred in production now as it is more stable and doesn’t require a separate reward model during the main training loop.
19. Security, Privacy & Ethics: The Fortress Revisited
19.1 Differential Privacy in Training
To prevent the model from memorizing a user’s specific password from the history DB, we add Gaussian Noise to the gradients during fine-tuning. This mathematically guarantees that no single user’s data can be perfectly reconstructed.
19.2 The “Red Teaming” Infrastructure
We hire security researchers to try and “break” the model. We then automate their successful attacks into a Regression Test Suite. Every new model version must pass these “Jailbreak Tests” before deployment.
20. Conclusion: The AI-First OS
Designing a chatbot at scale is the ultimate test of a system designer. It requires the precision of a high-frequency trading system, the scale of a global social network, and the nuance of a philologist.
As the “Principal Engineer” of this system, your goal is not just to build a bot, it is to build the Interface of the Future. A system that doesn’t just respond to text, but understands intent, anticipates needs, and operates with the reliability of a utility.
21. The Tokenizer System: The Alphabet of Machines
Before an LLM can understand text, it must be “tokenized.” A common mistake is thinking tokens = words. In reality, tokens are sub-word units.
21.1 Byte Pair Encoding (BPE)
Most modern chatbots use BPE.
- The Problem: If you use a vocabulary of full words, the vocabulary size becomes infinite. If you use characters, the sequence length becomes too long, exceeding the model’s context window.
- The Solution: Start with characters and iteratively merge the most frequent pairs.
- Example: “low”, “lower”, “newest”, “widest”. BPE might merge “e” and “s” into “es”, then “es” and “t” into “est”.
- Impact on Multi-lingual Performance: If your tokenizer is trained primary on English, a single Chinese character might take 3-4 tokens, making the model “more expensive” and “dumber” for Chinese users. Principal engineers ensure the tokenizer is trained on a diverse, representative corpus.
22. Streaming Architecture: SSE vs WebSockets
Why does ChatGPT use Server-Sent Events (SSE) instead of WebSockets?
| Feature | Server-Sent Events (SSE) | WebSockets |
|---|---|---|
| Direction | Unidirectional (Server -> Client) | Bidirectional |
| Protocol | Standard HTTP | Upgrade from HTTP to TCP |
| Ease of Load Balancing | High (It’s just HTTP) | Medium (Requires sticky sessions) |
| Reconnection | Automatic by the browser | Must be handled in JS |
The Decision: Since a chatbot response is mostly a long stream from the server, SSE is simpler and works better with existing HTTP infrastructure (CDNs, WAFs). We only use WebSockets if we need complex multi-player features (e.g., collaborative editing).
23. Implementing the Guardrail: The “Moderator” Pattern
Below is a Python snippet showing how to implement a safety layer that uses a smaller model to audit the main model’s output in real-time.
class SafetyGuardrail:
def __init__(self, admin_model="llama-guard-3"):
self.admin = admin_model
async def is_safe(self, text: str) -> bool:
"""
Calls a specialized safety model to classify text.
Returns False if the text violates safety policies (e.g., violence, PII).
"""
# Optimized for < 50ms latency
payload = {"prompt": f"[INST] Is this text safe? {text} [/INST]"}
async with aiohttp.ClientSession() as session:
async with session.post(SAFETY_SERVICE_URL, json=payload) as resp:
result = await resp.json()
return result['label'] == "safe"
async def secure_stream(self, generator: AsyncGenerator[str, None]):
"""
Wraps a token generator. Buffers tokens until a complete
sentence/phrase is formed, audits it, then yields it.
"""
buffer = ""
async for token in generator:
buffer += token
# Audit every sentence (period/newline)
if any(char in token for char in [".", "\n", "?", "!"]):
if await self.is_safe(buffer):
yield buffer
buffer = ""
else:
yield "[CONTENT REJECTED FOR SAFETY]"
return
# Yield remaining safe buffer
if buffer and await self.is_safe(buffer):
yield buffer
24. Advanced Vector Search: Product Quantization (PQ)
As your Knowledge Base grows to millions of documents, storing raw 1536-dimensional vectors in RAM becomes impossibly expensive.
The Compression Hack: Product Quantization
- Split: Divide a 1024-dimension vector into 8 sub-vectors of 128 dimensions each.
- Cluster: For each sub-vector space, find 256 “centroids” using K-Means.
- Assign: Replace each 128-dimensional sub-vector with its nearest centroid ID (1 byte).
- Result: Your 1024-dimension vector (originally 4096 bytes) is now just 8 bytes.
- Trade-off: You lose ~1-2% in retrieval precision but save 99% in RAM and gain 10x in search speed.
25. Data Governance: The Right to be Forgotten in AI
GDPR Article 17 (Right to Erasure) is a nightmare for AI.
- The Easy Part: Deleting the user’s chat history from the Cassandra database.
- The Hard Part: LLMs can “memorize” data through fine-tuning. If a user’s data was used in SFT (Supervised Fine-Tuning), you cannot “un-train” it easily.
- The Solution:
- Data TTL: We don’t use raw chat data for training until it’s at least 30 days old.
- In-Memory Erasure: If a user deletes their account, we add their
user_idto a “Negative Bloom Filter” in the Orchestrator. The model is then instructed (via system prompt) to never reference anything from that user’s history, even if it’s stored in embeddings.
26. Case Study: The “Thundering Herd” at a Global AI Launch
When a major AI company launched its mobile app, they saw a 100x spike in traffic within 10 minutes.
What Failed?
- The Auth Service: The sudden burst of JWT validations crashed the internal identity provider.
- The GPU Scheduler: Kubernetes tried to spin up 1,000 GPU nodes. The cloud provider’s API hit its “Rate Limit,” causing nodes to hang in
PENDINGstate.
The Fix:
- Circuit Breakers: The API Gateway started serving a “Waitlist” page to non-premium users, protecting the infrastructure for paying customers.
- Warm Pools: They now maintain a “Warm Pool” of 10% extra GPU capacity at all times, avoiding the 5-minute cold-start time of H100 nodes.
27. The Future: Multi-Agent Systems & MoE
The next evolution of ChatGPT is Mixture of Experts (MoE).
- Instead of one giant model, the system consists of 8 or 16 “Expert” models (layers).
- For a math question, the “Router” sends the query to the Math Expert.
- Result: You get the performance of a 1.7 Trillion parameter model (like GPT-4) with the inference cost of a 100B parameter model.
28. Final Summary Checklist for the Principal Engineer
| Category | Priority | Item |
|---|---|---|
| Latency | P0 | Time To First Token (TTFT) < 300ms. |
| Memory | P0 | PagedAttention implemented for KV cache. |
| History | P1 | Cassandra/ScyllaDB with sharded session_id. |
| RAG | P1 | Reranking layer (Cross-Encoders) included. |
| Security | P2 | PII scrubbing and Moderator model active. |
| Scale | P2 | Global traffic management with cross-region replication. |
29. Distributed Training: The Forge of Intelligence
To build a model that can chat like a human, you first need to train it. Training a 70B+ model is a feat of distributed engineering that requires a perfectly synchronized cluster.
29.1 The Three Pillars of Parallelism
- Data Parallelism (DP): Each GPU has a copy of the model, but sees different data. Synchronizing gradients via
All-Reducebecomes the bottleneck at scale. - Tensor Parallelism (TP): A single matrix multiplication is split across multiple GPUs. This requires ultra-low latency (InfiniBand) between GPUs within a single node.
- Pipeline Parallelism (PP): Layers are split across different nodes. To avoid idle GPUs (“Bubbles”), we use micro-batching and 1F1B (One Forward, One Backward) scheduling.
30. Graph RAG: Beyond Semantic Similarity
Vector databases (semantic search) are good at finding “similar” things, but they are terrible at “global reasoning.”
- Query: “How has the user’s opinion on climate change evolved over the last 10 sessions?”
- Vector Search Failure: It will find individual sentences about climate change, but it won’t understand the chronological relationship between them.
The Solution: Knowledge Graphs
By using a Graph Database (Neo4j), we can store entities as nodes (User, Topic: Climate Change) and interactions as edges (Opinion: Skeptical, Date: 2024-01-01).
- LLM as a Graph Constructor: The model extracts entities and relationships from every chat.
- Cypher Query Generation: When a complex question is asked, the orchestrator generates a graph query to traverse the user’s history conceptually.
31. Human-in-the-Loop (HITL): The Quality Lab
The most important part of AI development is not the model code; it is the Data Quality.
31.1 Designing the Labeling Dashboard
At Google and Meta, we build custom internal tools for RLHF.
- Blind Comparison: Labelers are shown two responses without knowing which model produced them.
- Granular Feedback: Labelers highlight specific parts of the text that are “Hallucinations” or “Toxic.”
- Active Learning: The system automatically identifies “High Uncertainty” queries (where the model is confused) and prioritizes them for human labeling.
32. Model Distillation: AI in Your Pocket
Serving a 70B model costs thousands of dollars a month per user. How do we make it cheap?
Knowledge Distillation
- The Teacher: A 405B parameter model (e.g., Llama-3 405B).
- The Student: A 1B parameter model.
- The Process: The student is trained not just on the text, but on the probability distribution (logits) of the teacher. It’s like a student learning not just the answer, but the teacher’s thought process.
- Edge Result: A 1B model distilled from a 405B model can often outperform a natively trained 7B model while being small enough to run on an iPhone 15 Pro.
33. The Philosophy of Infinite Context
We are entering the era of “Infinite Context” (e.g., Gemini 1.5 Pro). From a system design perspective, this changes everything.
- Ring Attention: A technique to compute attention across different GPUs in a circular fashion, allowing the context window to scale to millions of tokens.
- The New RAG: If you can fit a 1-million-token book into the prompt, do you still need a Vector DB?
- Answer: Yes, because processing 1M tokens every turn is too slow and expensive ($20 per query). RAG remains the “Fast Cache” for information.
34. Final Reflection: The Architect’s Responsibility
As we build these systems, we aren’t just shifting bits; we are building entities that will manage our schedules, write our code, and perhaps even comfort us in our loneliness.
The “System Design” of a chatbot is ultimately the design of a Human-Machine Relationship. Reliability is not just about uptime; it’s about Trust. Accuracy is not just about metrics; it’s about Truth.
Architecture is destiny. If you build a system that is brittle, the intelligence it contains will be brittle. If you build a system that is robust, transparent, and respectful of the user, you are contributing to a future where AI is not a mystery, but a partner.
FAQ
How does PagedAttention solve the KV cache memory problem for LLM serving?
PagedAttention splits the KV cache into small physical blocks of 16 tokens, allocated dynamically like OS virtual memory pages. This eliminates the waste from pre-allocating maximum sequence lengths (a 17-token request no longer wastes 2,031 tokens of VRAM in a 2,048 context). Shared system prompts across requests can point to the same physical pages, dramatically increasing the number of concurrent users per GPU.
What is speculative decoding and how does it speed up LLM inference?
A small 1B parameter drafter model predicts 5 tokens quickly, then the large 70B model verifies all 5 in a single parallel forward pass. Since verification costs roughly the same as generating one token, each batch of accepted drafts saves multiple full generation round-trips. This typically achieves 1.8x to 2.5x throughput improvement without any quality loss.
How do you manage chat history at scale for 200M daily active users?
Use a wide-column store like Cassandra or ScyllaDB with session_id as partition key and time-sorted message_id as clustering key. Redis caches recent session history (last 25 turns) for low-latency access. Multi-region replication ensures consistency when users travel across continents. At 10 billion messages per day, storage grows by petabytes and requires careful sharding strategies.
What is agentic RAG and how does it improve chatbot accuracy?
Agentic RAG goes beyond simple retrieve-and-dump by decomposing complex queries into sub-queries, combining vector search with BM25 keyword matching for hybrid retrieval, using HyDE (Hypothetical Document Embeddings) to generate fake answers for better semantic search, and applying cross-encoder reranking to select the top 5 most relevant documents from an initial pool of 100.
Originally published at: arunbaby.com/ml-system-design/0064-chatbot-system-design
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch