34 minute read

“Building a chatbot that responds is easy. Building a conversational system that remembers, reasons, and scales to millions of concurrent users without melting your GPU cluster is an engineering feat of the highest order.”

TL;DR

A production chatbot at 200M DAU requires three architectural layers: ingress (API gateway with WebSocket/SSE support), orchestration (context management, RAG retrieval, prompt construction), and intelligence (GPU inference with vLLM and PagedAttention). The math is staggering: 10 billion daily requests, 300K peak RPS, and 5TB of new text per day. PagedAttention enables efficient KV cache management, speculative decoding with a small drafter model multiplies throughput by 2x, and a hierarchical memory system from VRAM to vector databases provides both session context and long-term recall. For the optimization techniques that power the inference layer, see the inference optimization guide and the advanced caching strategies that reduce GPU load.

A telephone switchboard with dozens of patch cords connecting different jacks

1. Introduction: Beyond the Text Box

When we talk about a “Chatbot” today, we aren’t talking about the rule-based decision trees of 2015. We are talking about Large Language Model (LLM) powered entities that function as interfaces to human knowledge. From an engineering perspective, a system like ChatGPT or Claude is not just a model; it is a complex, stateful, multi-modal distributed system.

At the intersection of Google’s search infrastructure and Meta’s social graph lies the challenge of the modern Chatbot: it must be as fast as a search engine, as personal as a private message, and as reliable as a banking transaction.

In this deep dive, we will architect a chatbot system capable of handling 200 million daily active users (DAU), supporting real-time streaming, long-term memory, and Retrieval-Augmented Generation (RAG). This is the “Principal Engineer’s” blueprint for the most important software paradigm shift of the decade.


2. Problem Statement & Requirements

To design a system, we must first define the bounds of the sandbox.

2.1 Functional Requirements

  1. Low-Latency Conversational Interface: Users expect “streaming” responses (Time To First Token < 500ms).
  2. Stateful Conversations: The system must maintain context over multiple turns (Chat History).
  3. Multi-Modal Support: Ability to process text, images, and documents (PDFs) within the conversation.
  4. Retrieval-Augmented Generation (RAG): The chatbot must be able to “search” for facts outside its training data to provide accurate, up-to-date information.
  5. Multi-Platform Access: Web, mobile, and API access with consistent state.
  6. Human-like Personality & Safety: Guardrails against toxic content, PII leakage, and hallucinations.

2.2 Non-Functional Requirements

  1. High Availability: 99.9% uptime (global availability).
  2. Scalability: Handling 100k+ requests per second (RPS) during peak times.
  3. Consistency: Chat history must be consistent across devices.
  4. Cost Efficiency: Managing the high cost of GPU inference through optimizations.
  5. Data Security: Strict adherence to GDPR/CCPA, ensuring user data privacy.

2.3 Back-of-the-Envelope Calculations (The Reality Check)

  • Users: 200M DAU.
  • Average Sessions per User: 5 sessions/day.
  • Average Turns per Session: 10 turns.
  • Total Daily Inference Requests: 200M * 5 * 10 = 10 Billion requests/day.
  • Requests Per Second (Average): 10B / 86400 ≈ 115,000 RPS.
  • Peak RPS: ~2.5x average ≈ 300,000 RPS.
  • Storage: 10B turns/day * 500 characters/turn ≈ 5 Terabytes/day of text history.
  • Compute: A single H100 with vLLM might handle ~50-100 concurrent streams depending on model size (e.g., Llama-3 70B). We need thousands of GPUs.

3. High-Level Architecture: The Nervous System of AI

The architecture is split into the Control Plane (UI, API Gateway, Orchestrator) and the Data Plane (LLM Serving, Vector DB, Knowledge Base).

[ User Interface ]  <-- (WebSockets / SSE) -->  [ API Gateway ]
       |                                             |
       v                                             v
[ CDN / Edge ] <---- [ Auth / Rate Limit ] <---- [ Orchestrator ]
                                                     |
        +-----------------------+--------------------+-----------------------+
        |                       |                    |                       |
## 3. High-Level Architecture: The Nervous System of AI

The architecture is split into three primary layers: the **Ingress Layer**, the **Orchestration Layer**, and the **Intelligence Layer**. This separation ensures that scaling the user interface doesn't require scaling the expensive GPU resources proportionally.

### 3.1 The Global Blueprint

```text
                                  [ USER CLIENTS ]
                            (Web, iOS, Android, Desktop)
                                         |
                                         v
 [ EDGE NETWORK ] <----------------> [ GLOBAL LOAD BALANCER ]
 (Cloudflare/Akamai)                 (Anycast IP, TLS Term)
                                         |
                                         v
 [ API GATEWAY / BORDER ] <--------> [ AUTH & RATE LIMITER ]
 (Kong / Envoy / App Mesh)           (JWT, OAuth2, Redis Quotas)
                                         |
                                         +-----------------------+
                                         |                       |
                                         v                       v
 [ ORCHESTRATION LAYER ] <-------> [ CHAT ORCHESTRATOR ]   [ AGENT WORKERS ]
 (Go / Rust Microservices)         (Context Mgmt, RAG)     (Tool Use, Python)
                                         |                       |
                                         +-----------+-----------+
                                                     |
                                                     v
 [ STORAGE & STATE ] <-----------+-----------+-------+-------+-----------+
                                 |           |               |           |
                           [ CACHE ]   [ HISTORY DB ] [ VECTOR DB ] [ BLOB STORE ]
                           (Redis)     (Cassandra)    (Pinecone)    (S3/GCS)
                                 |           |               |           |
                                 +-----------+-------+-------+-----------+
                                                     |
                                                     v
 [ INTELLIGENCE LAYER ] <------------------> [ INFERENCE ENGINE ]
 (GPU Clusters / K8s)                        (vLLM / TensorRT-LLM)
                                                     |
                                         [ LLM / VFM MODELS ]
                                         (Llama 3, GPT-4o, etc.)

3.2 Component Roles in Detail

  1. API Gateway: Handles request routing, SSL termination, and provides a unified interface. For a chatbot, this gateway must support long-lived connections (WebSockets) and streaming responses (SSE).
  2. Chat Orchestrator: This is the stateful (or semi-stateful) layer that coordinates between history databases, vector databases, and the inference engine. It is responsible for Prompt Engineering transforming the raw user input into a rich, context-aware prompt.
  3. Inference Engine: The actual GPU-accelerated service. We use specialized engines like vLLM because they implement advanced memory management (PagedAttention) which is the difference between serving 10 users per GPU or 1,000.
  4. Vector Database: Stores high-dimensional representations (embeddings) of structured and unstructured knowledge. When a user asks a specific question, we query this database to find relevant documents to “ground” the model’s response.
  5. History Database: A high-write-throughput database like Cassandra or ScyllaDB. Every turn in every conversation is a record. At scale, this database grows by petabytes.

4. Requirement Deep-Dive: Scaling the “Unscalable”

4.1 Traffic Analysis (The Principal Engineer’s Spreadsheet)

Let’s look at the numbers again. If we have 200 million daily active users (DAU), and they average 50 messages a day (typical for power users), we are looking at 10 billion messages.

  • Inbound Bandwidth: 10B * 1KB/message ≈ 10 TB/day.
  • Outbound Bandwidth: LLM responses are longer. 10B * 4KB/response ≈ 40 TB/day.
  • Write Operations: 20B writes/day (1 for user, 1 for bot) ≈ 230k writes per second.
  • Storage (7-year retention): 50 TB/day * 2500 days ≈ 125 Petabytes.

This is not a “simple” database problem. This requires a sharded, multi-regional database architecture where we partition by session_id or user_id.


5. Component Deep-Dives: Engineering the Invisible

5.1 The Orchestrator: Context Management & The Sliding Window

An LLM has a finite “context window” (e.g., 128k tokens for GPT-4). However, as a conversation grows, you can’t just keep appending messages. Eventually, you run out of space, and the cost per request explodes (since you pay per token processed).

Strategies for Context Truncation:

  1. First-In-First-Out (FIFO): Drop the oldest messages. Simple but loses the “start” of the conversation which often contains instructions.
  2. Summarization Cache: When the context reaches 75% of the limit, trigger an async job to summarize the oldest 50% into a concise paragraph. Replace those 50% tokens with the summary.
  3. Selective Memory (Entity-Based): Extract key entities (User’s name, preferences, discussed topics) and store them in a structural “User Profile” in the History DB. Inject this profile into every prompt.

5.2 The LLM Serving Engine: The Math of KV Caching

Why do we need specialized engines? To understand this, we must understand the KV Cache. During LLM generation, the model predicts one token at a time. To predict the $N$-th token, it needs the activations of all previous $N-1$ tokens. Instead of recomputing these every time (which would be $O(N^2)$), we store the results in a Key-Value (KV) cache.

The Memory Problem: A Llama-3 70B model with a 128k context window:

  • KV cache size per token: ~1 MB (depending on precision and layer count).
  • Max context for one user: 128,000 tokens * 1MB ≈ 128 GB.
  • A single A100 GPU has only 80GB of VRAM.

The Solution: PagedAttention Just like Operating Systems use paging to handle fragmented memory, vLLM uses PagedAttention. It breaks the KV cache into small blocks.

  • Zero Fragmentation: Memory is only allocated when needed.
  • Shared Memory: If multiple users are asking questions about the same document (e.g., a shared corporate policy), the KV cache for that document can be shared across multiple requests, saving GIGABYTES of VRAM.

5.3 Advanced Retrieval-Augmented Generation (RAG)

Standard RAG (retrieve and dump) often fails because the retrieved context is too noisy or poorly formatted.

The “Agentic RAG” Pipeline:

  1. Query Decomposition: If a user asks “Compare the Q3 earnings of Apple and Tesla,” the system splits this into two sub-queries: “Apple Q3 earnings” and “Tesla Q3 earnings.”
  2. Hybrid Search: Combining Vector Search (semantic similarity) with BM25 (keyword matching). This ensures that if a user searches for a specific part number, they find it exactly.
  3. HyDE (Hypothetical Document Embeddings): The system first asks an LLM to “write a fake answer.” It then uses that fake answer to search the vector database. This works because the fake answer is closer in “semantic space” to the real answer than the user’s question is.
  4. Reranking: After getting the top 100 documents from the Vector DB, we use a slower but more accurate Cross-Encoder model to re-score them and take the top 5.

6. Implementation: The Scalable Streaming Orchestrator

Below is a production-grade Python implementation of an asynchronous orchestrator. It handles RAG, Context Management, and Streaming.

import asyncio
import json
import logging
import time
from typing import AsyncGenerator, List, Optional
import aiohttp
from pydantic import BaseModel
import redis.asyncio as redis
from motor.motor_asyncio import AsyncIOMotorClient

# Set up logging for observability
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class Message(BaseModel):
    role: str
    content: str

    class ChatConfig:
        MAX_CONTEXT_TOKENS = 4096
        RECOVERY_THRESHOLD = 0.8  # Summarize after 80% usage
        EMBEDDING_MODEL = "text-embedding-3-small"
        LLM_ENDPOINT = "http://inference-internal.ai.corp/v1/chat/completions"

    class AdvancedOrchestrator:
    def __init__(self, session_id: str):
        self.session_id = session_id
        self.redis = redis.from_url("redis://localhost:6379/0", decode_responses=True)
        self.db = AsyncIOMotorClient("mongodb://localhost:27017").ai_platform
        self.history_key = f"messages:{session_id}"

        async def _get_recent_history(self) -> List[Message]:
            """Fetches and reconstructs message history with caching."""
            cached = await self.redis.lrange(self.history_key, 0, -1)
            if cached:
                return [Message(**json.loads(m)) for m in cached]

                # Fallback to persistent storage
                cursor = self.db.history.find({"session_id": self.session_id}).sort("ts", 1)
                history = [Message(role=d["role"], content=d["content"]) for d in await cursor.to_list(None)]

                # Rehydrate cache
                if history:
                    await self.redis.lpush(self.history_key, *[json.dumps(m.dict()) for m in history])
                    return history

                    async def _retrieve_context(self, query: str) -> str:
                        """semantic RAG step: Vector DB lookup."""
                        # In a real system, this would call Pinecone/Milvus
                        # 1. Generate Embedding
                        # 2. Vector Search
                        # 3. Filter by metadata (namespace/user_id)
                        return "Reference: The system uses PagedAttention for KV cache efficiency."

                        async def stream_chat(self, user_input: str) -> AsyncGenerator[str, None]:
                            """The main loop: Retrieve -> Augment -> Generate -> Persist."""

                            # 1. Parallel History & Context Loading
                            history, context_docs = await asyncio.gather(
                            self._get_recent_history(),
                            self._retrieve_context(user_input)
                            )

                            # 2. Construct the system prompt with RAG grounding
                            system_prompt = f"You are a helpful assistant. Use this context: {context_docs}"
                            messages = [{"role": "system", "content": system_prompt}]
                            messages.extend([m.dict() for m in history])
                            messages.append({"role": "user", "content": user_input})

                            # 3. Streaming Request to Inference Cluster
                            full_response = ""
                            async with aiohttp.ClientSession() as session:
                                payload = {
                                "model": "llama-3-70b-pro",
                                "messages": messages,
                                "stream": True,
                                "temperature": 0.2
                                }

                                async with session.post(ChatConfig.LLM_ENDPOINT, json=payload) as resp:
                                    if resp.status != 200:
                                        yield "Error: Service Unavailable"
                                        return

                                        async for line in resp.content:
                                            if line:
                                                decoded_line = line.decode('utf-8').strip()
                                                if decoded_line.startswith("data: "):
                                                    if decoded_line == "data: [DONE]":
                                                        break

                                                        try:
                                                            chunk = json.loads(decoded_line[6:])
                                                            token = chunk['choices'][0]['delta'].get('content', '')
                                                            full_response += token
                                                            yield token
                                                        except Exception as e:
                                                            logger.error(f"Parse error: {e}")

                                                            # 4. Asynchronous State Persistence
                                                            asyncio.create_task(self._persist_turn(user_input, full_response))

                                                            async def _persist_turn(self, user_text: str, bot_text: str):
                                                                """Heavy lifting for DB writes moved out of the hot path."""
                                                                ts = time.time()
                                                                turns = [
                                                                {"session_id": self.session_id, "role": "user", "content": user_text, "ts": ts},
                                                                {"session_id": self.session_id, "role": "assistant", "content": bot_text, "ts": ts + 0.1}
                                                                ]

                                                                # Write to Mongo
                                                                await self.db.history.insert_many(turns)

                                                                # Update Redis Cache
                                                                await self.redis.rpush(self.history_key, json.dumps(turns[0]), json.dumps(turns[1]))
                                                                await self.redis.ltrim(self.history_key, -50, -1)  # Keep only last 25 turns (50 items) in cache

7. Scaling & Optimization: The Billion Dollar Infrastructure

7.1 Model Serving Strategies: Throughput vs Latency

In a massive system, you have different traffic types. We use Route Optimization to handle this:

Traffic Type Example Optimization Strategy
B2C Chat Daily user banter High Throughput: Quantized (FP8/INT4) models, high batch sizes (256+).
Enterprise RAG Legal document analysis High Precision: Full FP16 precision, Speculative Decoding for speed.
API/Automation Systematic data extraction Batch Processing: Lower priority queues, utilized on “Spot Instances.”

7.2 The Speculative Decoding Architecture

To achieve sub-50ms token latency on high-quality models, we use a “Draft-Verify” loop.

  1. Draft: A tiny 1B parameter model (the Drafter) guesses the next 5 tokens.
  2. Verify: The 70B model (the Oracle) checks all 5 tokens in a single parallel step.
  3. Result: If 4 tokens were right, we just saved 3 round-trips through the big model. Hardware throughput increases by 1.8x to 2.5x.

7.3 Multi-Modal Data Flow: Handling Images & PDFS

When a user uploads an image (“What’s in this picture?”), the system doesn’t just send pixels to the LLM.

  1. Vision Encoder: A model like CLIP or a ViT (Vision Transformer) converts the image into a 1024-dimension vector.
  2. Projection Layer: A small bridge network maps the image vector into the “Word Embedding” space of the LLM.
  3. Interleaved Inference: The LLM treats the image as a sequence of “Visual Tokens” similar to text tokens.

8. Database Architecture: Trillion-Row Scalability

For chat history, RDBMS (PostgreSQL/MySQL) will fail due to the sheer volume of writes and the need for global replication.

8.1 The ScyllaDB/Cassandra Schema

We use a Wide Column Store approach.

  • Partition Key: session_id (Ensures all messages for one chat are on the same physical node).
  • Clustering Key: message_id (Sorted by time for efficient range queries).
CREATE TABLE chat_history (
    session_id UUID,
    message_id TIMEUUID,
    role TEXT, -- 'user', 'assistant', 'system'
    content TEXT,
    metadata JSONB, -- For storing token counts, model versions
    PRIMARY KEY (session_id, message_id)
) WITH CLUSTERING ORDER BY (message_id ASC);

Global Distribution: We use Multi-Region Replication. If a user travels from NY to London, the Local Load Balancer directs them to the eu-west-1 stack. The data is replicated asynchronously between regions, achieving < 500ms cross-continent sync.


9. Monitoring & Guardrails: Safety at Scale

9.1 The “Jailbreak” Detection Pipeline

We implement a Three-Layer Guardrail:

  1. Pre-Inference: A regex + FastText model checks for “forbidden words” and dangerous patterns (e.g., prompt injection).
  2. In-Inference: Token-level stopping. If the model starts generating the string curl http://..., the inference engine kills the stream immediately.
  3. Post-Inference: A “Moderation API” (like OpenAI’s moderation-latest) checks the full response before it’s finalized in the DB.

9.2 Observability Dashboard

Principal engineers look at Distributions, not averages.

  • P99 TTFT: The experience of the unluckiest 1% of users. If this spikes, your inference queue is saturated.
  • Token Convergence: Ratio of generated tokens to prompt tokens. Helps in calculating the ROI of the system.
  • Prompt-Cache Hit Rate: % of queries that reused a previously computed KV cache.

10. Cost Analysis: The Unit Economics of AI

Item Cost (H100 instance) Monthly (scaled)
Model Weights $0 (Open Source / Llama) $0
GPU Rental (8x H100) $24 / hour $17,280 / node
Typical Cluster (100 Nodes) $2,400 / hour $1.7M / month
Token Cost (Internal) ~$0.005 / 1k tokens Depends on usage

How to reduce this?

  • Distillation: Train a smaller model to act as a specialist for common tasks (e.g., “Summarization”).
  • Quantization: Moving from 16-bit to 4-bit weights reduces RAM needs by 75% without significant quality loss.

11. Real-World Failure Modes & Mitigation

  1. The Thundering Herd: 1M users all message at once (e.g., during a major event).
    • Mitigation: Adaptive Rate Limiting. Tier users based on subscription status or historical value.
  2. Context Poisoning: User provides a 50-page PDF that “brainwashes” the model into giving bad advice.
    • Mitigation: Isolated RAG environments. The model doesn’t “know” the document is true; it is instructed to treat it as “External provided info.”
  3. Silent Drift: The model’s answers become shorter and less helpful over time.
    • Mitigation: Weekly Golden Dataset benchmarking. Compare production outputs against a set of 1,000 “Perfect Answers.”

12. Future Outlook: The Agentic Web

We are moving away from “Chat” and toward “Task Orchestration.” The next version of this architecture replaces the Chat Orchestrator with an Agent Loop.

  • Tools: Browsers, Python interpreters, and API clients.
  • Reasoning: ReAct (Reason + Act) patterns where the model thinks, acts, observes, and repeats.
  • Long-term Personalization: Storing user “Archetypes” in Knowledge Graphs.


14. Component Deep-Dive: The API Gateway & Connection Management

In a standard web app, HTTP is stateless. In a chatbot, we need Streaming (SSE) or Persistent (WebSockets) connections. This introduces the “Sticky Session” problem at the Load Balancer level.

14.1 Handling 1 Million Concurrent WebSockets

When you have a million users connected via WebSockets, you cannot just reboot a load balancer. Every disconnection triggers a “Thundering Herd” where a million clients try to reconnect simultaneously, potentially DOS-ing your Auth service.

Principal Engineer’s Approach:

  1. Consistent Hashing: Use a consistent hashing ring at the Load Balancer (e.g., Nginx with ip_hash or Envoy with maglev). This ensures that if a specific gateway node stays up, the user session stays on that node.
  2. Connection Bloat: Each open WebSocket takes ~10-50KB of memory on the server. A single 64GB RAM node can theoretically handle ~1M connections, but context-switching and TCP overhead usually cap this at 100k-200k.
  3. Backpressure: If the LLM serving engine is slow, the Orchestrator must apply backpressure to the Frontend. We use a Token Bucket Algorithm per user session to ensure no one user starves the system.

14.2 The “Graceful Shutdown” Pattern

To avoid the Thundering Herd during deployments:

  1. The Gateways stop accepting new connections.
  2. Existing connections are slowly closed over a 15-minute window (e.g., closing 1% of connections every 10 seconds).
  3. Clients implement Exponential Backoff with Jitter on reconnections to spread the load.

15. The “Memory” System: From Silicon to Intelligence

Static RAG is often called “Short-term Memory.” But what if a user told you their name in 2023 and expects you to remember it in 2025?

15.1 Hierarchical Memory Architecture

Memory Tier Storage Medium Latency Purpose
L1: Context GPU VRAM < 1ms Immediate tokens in the current sliding window.
L2: Session Cache Redis 2-5ms Full history of the current chat session.
L3: Profile Store DynamoDB / Cassandra 10-20ms Long-term user preferences (Extracted via LLM).
L4: Knowledge Vector DB (Pinecone) 50-100ms Global facts and uploaded documents.

15.2 MemGPT Concept: The OS for LLMs

Inspired by the MemGPT research, we design our orchestrator to act like an Operating System:

  • Virtual Memory: The model can “Swap” context to the database and “Page” it back in when needed.
  • Interrupters: The system can triggeer a “Recall” function if it detects the user is asking about a past event that is not in the current context.

16. The Evaluation Pipeline: The Secret Sauce of Quality

A model that passes benchmarks (MMLU) might still be a terrible chatbot. We need a Systemic Evaluation pipeline.

16.1 LLM-as-a-Judge

We use a “Gold Standard” model (e.g., GPT-4o) to grade our “Candidate” model (e.g., Llama-3 fine-tune). Rubric-based Grading:

  • Faithfulness: Does the answer stay within the provided context? (Score 1-5)
  • Relevance: Does it actually answer the user’s question? (Score 1-5)
  • Tone: Is the persona consistent? (Score 1-5)

16.2 A/B Testing in AI

Traditional A/B testing measures clicks. AI A/B testing measures User Retention and Session Depth.

  • Experiment A: Context window of 8k tokens.
  • Experiment B: Context window of 16k tokens.
  • Metric: Is the user more likely to ask a follow-up question in Group B? If so, the higher GPU cost of B might be justified by higher engagement.

17. Deployment & Infrastructure: Kubernetes for AI

Deploying AI on K8s requires specialized resource management. Standard HPA (Horizontal Pod Autoscaler) on CPU is insufficient.

17.1 The “Custom Metrics” Autoscaler

We use KEDA (Kubernetes Event-driven Autoscaling) to scale GPU pods based on:

  1. Request Queue Length: How many users are waiting for a token?
  2. Inference Throughput: Are our GPUs hitting their FLOP limits?

17.2 Tainting and Tolerations

GPU nodes are expensive. We “Taint” them so that standard web services (like the dashboard) don’t accidentally land on an H100 node.

  • Taint: dedicated=gpu:NoSchedule
  • Toleration: Only the inference engine pods have the toleration to be scheduled on these nodes.

17.3 Multi-Cluster Fleet Management

OpenAI or Meta don’t run in a single region. They use a Global Traffic Manager (GTM).

  • If us-east-1 is at 90% GPU capacity, incoming requests are routed to eu-west-1 at the cost of slight network latency (100ms vs 10ms).
  • Cross-Region KV Caching: We replicate the first part of common prompts (System Prompts) globally so that the KV cache starts “hot” even in a new region.

18. Training & Fine-Tuning: The RLHF Cycle

To get that “ChatGPT feel,” the base model must undergo Alignment.

18.1 Supervised Fine-Tuning (SFT)

We use a high-quality dataset of ~100k “Perfect Conversations.” We use LoRA (Low-Rank Adaptation) to train only 1% of the model’s weights, which is 10x faster and requires 4x less VRAM.

18.2 Reinforcement Learning from Human Feedback (RLHF)

  1. Ranking: Humans are shown two model outputs and asked to pick the better one.
  2. Reward Model: A model is trained to predict the human preference.
  3. PPO/DPO: The chatbot model is trained to maximize the “Reward” from the Reward Model. DPO (Direct Preference Optimization) is often preferred in production now as it is more stable and doesn’t require a separate reward model during the main training loop.

19. Security, Privacy & Ethics: The Fortress Revisited

19.1 Differential Privacy in Training

To prevent the model from memorizing a user’s specific password from the history DB, we add Gaussian Noise to the gradients during fine-tuning. This mathematically guarantees that no single user’s data can be perfectly reconstructed.

19.2 The “Red Teaming” Infrastructure

We hire security researchers to try and “break” the model. We then automate their successful attacks into a Regression Test Suite. Every new model version must pass these “Jailbreak Tests” before deployment.


20. Conclusion: The AI-First OS

Designing a chatbot at scale is the ultimate test of a system designer. It requires the precision of a high-frequency trading system, the scale of a global social network, and the nuance of a philologist.

As the “Principal Engineer” of this system, your goal is not just to build a bot, it is to build the Interface of the Future. A system that doesn’t just respond to text, but understands intent, anticipates needs, and operates with the reliability of a utility.


21. The Tokenizer System: The Alphabet of Machines

Before an LLM can understand text, it must be “tokenized.” A common mistake is thinking tokens = words. In reality, tokens are sub-word units.

21.1 Byte Pair Encoding (BPE)

Most modern chatbots use BPE.

  • The Problem: If you use a vocabulary of full words, the vocabulary size becomes infinite. If you use characters, the sequence length becomes too long, exceeding the model’s context window.
  • The Solution: Start with characters and iteratively merge the most frequent pairs.
    • Example: “low”, “lower”, “newest”, “widest”. BPE might merge “e” and “s” into “es”, then “es” and “t” into “est”.
  • Impact on Multi-lingual Performance: If your tokenizer is trained primary on English, a single Chinese character might take 3-4 tokens, making the model “more expensive” and “dumber” for Chinese users. Principal engineers ensure the tokenizer is trained on a diverse, representative corpus.

22. Streaming Architecture: SSE vs WebSockets

Why does ChatGPT use Server-Sent Events (SSE) instead of WebSockets?

Feature Server-Sent Events (SSE) WebSockets
Direction Unidirectional (Server -> Client) Bidirectional
Protocol Standard HTTP Upgrade from HTTP to TCP
Ease of Load Balancing High (It’s just HTTP) Medium (Requires sticky sessions)
Reconnection Automatic by the browser Must be handled in JS

The Decision: Since a chatbot response is mostly a long stream from the server, SSE is simpler and works better with existing HTTP infrastructure (CDNs, WAFs). We only use WebSockets if we need complex multi-player features (e.g., collaborative editing).


23. Implementing the Guardrail: The “Moderator” Pattern

Below is a Python snippet showing how to implement a safety layer that uses a smaller model to audit the main model’s output in real-time.

class SafetyGuardrail:
    def __init__(self, admin_model="llama-guard-3"):
        self.admin = admin_model

        async def is_safe(self, text: str) -> bool:
            """
            Calls a specialized safety model to classify text.
            Returns False if the text violates safety policies (e.g., violence, PII).
            """
            # Optimized for < 50ms latency
            payload = {"prompt": f"[INST] Is this text safe? {text} [/INST]"}
            async with aiohttp.ClientSession() as session:
                async with session.post(SAFETY_SERVICE_URL, json=payload) as resp:
                    result = await resp.json()
                    return result['label'] == "safe"

                    async def secure_stream(self, generator: AsyncGenerator[str, None]):
                        """
                        Wraps a token generator. Buffers tokens until a complete
                        sentence/phrase is formed, audits it, then yields it.
                        """
                        buffer = ""
                        async for token in generator:
                            buffer += token
                            # Audit every sentence (period/newline)
                            if any(char in token for char in [".", "\n", "?", "!"]):
                                if await self.is_safe(buffer):
                                    yield buffer
                                    buffer = ""
                                else:
                                    yield "[CONTENT REJECTED FOR SAFETY]"
                                    return

                                    # Yield remaining safe buffer
                                    if buffer and await self.is_safe(buffer):
                                        yield buffer

24. Advanced Vector Search: Product Quantization (PQ)

As your Knowledge Base grows to millions of documents, storing raw 1536-dimensional vectors in RAM becomes impossibly expensive.

The Compression Hack: Product Quantization

  1. Split: Divide a 1024-dimension vector into 8 sub-vectors of 128 dimensions each.
  2. Cluster: For each sub-vector space, find 256 “centroids” using K-Means.
  3. Assign: Replace each 128-dimensional sub-vector with its nearest centroid ID (1 byte).
  4. Result: Your 1024-dimension vector (originally 4096 bytes) is now just 8 bytes.
  5. Trade-off: You lose ~1-2% in retrieval precision but save 99% in RAM and gain 10x in search speed.

25. Data Governance: The Right to be Forgotten in AI

GDPR Article 17 (Right to Erasure) is a nightmare for AI.

  1. The Easy Part: Deleting the user’s chat history from the Cassandra database.
  2. The Hard Part: LLMs can “memorize” data through fine-tuning. If a user’s data was used in SFT (Supervised Fine-Tuning), you cannot “un-train” it easily.
  3. The Solution:
    • Data TTL: We don’t use raw chat data for training until it’s at least 30 days old.
    • In-Memory Erasure: If a user deletes their account, we add their user_id to a “Negative Bloom Filter” in the Orchestrator. The model is then instructed (via system prompt) to never reference anything from that user’s history, even if it’s stored in embeddings.

26. Case Study: The “Thundering Herd” at a Global AI Launch

When a major AI company launched its mobile app, they saw a 100x spike in traffic within 10 minutes.

What Failed?

  • The Auth Service: The sudden burst of JWT validations crashed the internal identity provider.
  • The GPU Scheduler: Kubernetes tried to spin up 1,000 GPU nodes. The cloud provider’s API hit its “Rate Limit,” causing nodes to hang in PENDING state.

The Fix:

  • Circuit Breakers: The API Gateway started serving a “Waitlist” page to non-premium users, protecting the infrastructure for paying customers.
  • Warm Pools: They now maintain a “Warm Pool” of 10% extra GPU capacity at all times, avoiding the 5-minute cold-start time of H100 nodes.

27. The Future: Multi-Agent Systems & MoE

The next evolution of ChatGPT is Mixture of Experts (MoE).

  • Instead of one giant model, the system consists of 8 or 16 “Expert” models (layers).
  • For a math question, the “Router” sends the query to the Math Expert.
  • Result: You get the performance of a 1.7 Trillion parameter model (like GPT-4) with the inference cost of a 100B parameter model.

28. Final Summary Checklist for the Principal Engineer

Category Priority Item
Latency P0 Time To First Token (TTFT) < 300ms.
Memory P0 PagedAttention implemented for KV cache.
History P1 Cassandra/ScyllaDB with sharded session_id.
RAG P1 Reranking layer (Cross-Encoders) included.
Security P2 PII scrubbing and Moderator model active.
Scale P2 Global traffic management with cross-region replication.

29. Distributed Training: The Forge of Intelligence

To build a model that can chat like a human, you first need to train it. Training a 70B+ model is a feat of distributed engineering that requires a perfectly synchronized cluster.

29.1 The Three Pillars of Parallelism

  1. Data Parallelism (DP): Each GPU has a copy of the model, but sees different data. Synchronizing gradients via All-Reduce becomes the bottleneck at scale.
  2. Tensor Parallelism (TP): A single matrix multiplication is split across multiple GPUs. This requires ultra-low latency (InfiniBand) between GPUs within a single node.
  3. Pipeline Parallelism (PP): Layers are split across different nodes. To avoid idle GPUs (“Bubbles”), we use micro-batching and 1F1B (One Forward, One Backward) scheduling.

30. Graph RAG: Beyond Semantic Similarity

Vector databases (semantic search) are good at finding “similar” things, but they are terrible at “global reasoning.”

  • Query: “How has the user’s opinion on climate change evolved over the last 10 sessions?”
  • Vector Search Failure: It will find individual sentences about climate change, but it won’t understand the chronological relationship between them.

The Solution: Knowledge Graphs By using a Graph Database (Neo4j), we can store entities as nodes (User, Topic: Climate Change) and interactions as edges (Opinion: Skeptical, Date: 2024-01-01).

  1. LLM as a Graph Constructor: The model extracts entities and relationships from every chat.
  2. Cypher Query Generation: When a complex question is asked, the orchestrator generates a graph query to traverse the user’s history conceptually.

31. Human-in-the-Loop (HITL): The Quality Lab

The most important part of AI development is not the model code; it is the Data Quality.

31.1 Designing the Labeling Dashboard

At Google and Meta, we build custom internal tools for RLHF.

  • Blind Comparison: Labelers are shown two responses without knowing which model produced them.
  • Granular Feedback: Labelers highlight specific parts of the text that are “Hallucinations” or “Toxic.”
  • Active Learning: The system automatically identifies “High Uncertainty” queries (where the model is confused) and prioritizes them for human labeling.

32. Model Distillation: AI in Your Pocket

Serving a 70B model costs thousands of dollars a month per user. How do we make it cheap?

Knowledge Distillation

  1. The Teacher: A 405B parameter model (e.g., Llama-3 405B).
  2. The Student: A 1B parameter model.
  3. The Process: The student is trained not just on the text, but on the probability distribution (logits) of the teacher. It’s like a student learning not just the answer, but the teacher’s thought process.
  4. Edge Result: A 1B model distilled from a 405B model can often outperform a natively trained 7B model while being small enough to run on an iPhone 15 Pro.

33. The Philosophy of Infinite Context

We are entering the era of “Infinite Context” (e.g., Gemini 1.5 Pro). From a system design perspective, this changes everything.

  • Ring Attention: A technique to compute attention across different GPUs in a circular fashion, allowing the context window to scale to millions of tokens.
  • The New RAG: If you can fit a 1-million-token book into the prompt, do you still need a Vector DB?
    • Answer: Yes, because processing 1M tokens every turn is too slow and expensive ($20 per query). RAG remains the “Fast Cache” for information.

34. Final Reflection: The Architect’s Responsibility

As we build these systems, we aren’t just shifting bits; we are building entities that will manage our schedules, write our code, and perhaps even comfort us in our loneliness.

The “System Design” of a chatbot is ultimately the design of a Human-Machine Relationship. Reliability is not just about uptime; it’s about Trust. Accuracy is not just about metrics; it’s about Truth.

Architecture is destiny. If you build a system that is brittle, the intelligence it contains will be brittle. If you build a system that is robust, transparent, and respectful of the user, you are contributing to a future where AI is not a mystery, but a partner.


FAQ

How does PagedAttention solve the KV cache memory problem for LLM serving?

PagedAttention splits the KV cache into small physical blocks of 16 tokens, allocated dynamically like OS virtual memory pages. This eliminates the waste from pre-allocating maximum sequence lengths (a 17-token request no longer wastes 2,031 tokens of VRAM in a 2,048 context). Shared system prompts across requests can point to the same physical pages, dramatically increasing the number of concurrent users per GPU.

What is speculative decoding and how does it speed up LLM inference?

A small 1B parameter drafter model predicts 5 tokens quickly, then the large 70B model verifies all 5 in a single parallel forward pass. Since verification costs roughly the same as generating one token, each batch of accepted drafts saves multiple full generation round-trips. This typically achieves 1.8x to 2.5x throughput improvement without any quality loss.

How do you manage chat history at scale for 200M daily active users?

Use a wide-column store like Cassandra or ScyllaDB with session_id as partition key and time-sorted message_id as clustering key. Redis caches recent session history (last 25 turns) for low-latency access. Multi-region replication ensures consistency when users travel across continents. At 10 billion messages per day, storage grows by petabytes and requires careful sharding strategies.

What is agentic RAG and how does it improve chatbot accuracy?

Agentic RAG goes beyond simple retrieve-and-dump by decomposing complex queries into sub-queries, combining vector search with BM25 keyword matching for hybrid retrieval, using HyDE (Hypothetical Document Embeddings) to generate fake answers for better semantic search, and applying cross-encoder reranking to select the top 5 most relevant documents from an initial pool of 100.


Originally published at: arunbaby.com/ml-system-design/0064-chatbot-system-design

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch