14 minute read

"In the world of high-scale inference, 100 milliseconds isn’t just a delay; it’s a cost center. When serving millions of users, every nanosecond shaved off a kernel launch translates directly into throughput, compute efficiency, and bottom-line savings."

TL;DR

Real ML inference speedups come from mastering 14 techniques at the hardware-software boundary. Fused attention kernels keep computation in fast SRAM instead of slow HBM, with FlashAttention-3 achieving 1.2 PFLOPs/s in FP8 on Hopper GPUs. CUDA graphs eliminate CPU overhead by recording the entire computation as a static graph. PagedAttention manages KV cache like virtual memory with dynamic block allocation. Speculative decoding uses a small drafter model to predict multiple tokens verified in a single pass, multiplying throughput 1.8-2.5x. Tensor parallelism handles intra-node sharding over NVLink, while pipeline parallelism splits layers across nodes. For the systems that deploy these techniques at scale, see the chatbot system design and the open models retrospective covering the architectures being served.

A Formula 1 pit stop with the car elevated and multiple pneumatic tools working simultaneously on different parts

1. Introduction: The Billion-Dollar Latency Problem

The field of Machine Learning System Design has shifted. We are no longer in an era where "just adding more GPUs" is a viable scaling strategy. The modern challenge is Inference Efficiency. While many talk about quantization or pruning, those are often just the first steps. Real speedups, the kind that allow a startup to compete with giants or a giant to reduce its burn by millions, come from a deep understanding of the hardware-software boundary.

This guide is a deep dive into the 14 techniques that actually move the needle. We are moving past the high-level abstractions and into the world of fused kernels, CUDA graphs, and memory-aligned tensors. If you want to build systems that handle the demands of 2026, these are the concepts you must master.


2. Kernel-Level Mastery: The Front Lines of Speed

2.1 Fused Attention Kernels

Standard attention implementations calculate $Q, K, V$, then perform a series of operations: Softmax, Dropout, and Weighted Sum. Each of these typically involves a separate kernel launch and a trip back to Global Memory (HBM).

Architecture Diagram: Fused vs. Unfused Execution

[ UNFUSED ATTENTION ]                     [ FUSED ATTENTION (Flash) ]
       |                                           |
 +-----+-----+  (Launch 1)                   +-----+-----+
 | Q*K.T     | ----> [HBM]                   |  Load Q,K |
 +-----+-----+                               |  Compute  |
       |                                     |  Softmax  | (On-SRAM)
 +-----+-----+  (Launch 2)                   |  Update V |
 | Softmax   | ----> [HBM]                   +-----+-----+
 +-----+-----+                                     |
       |                                     [ Final Out ]
 +-----+-----+  (Launch 3)
 | Output    |
 +-----+-----+
  • Technique: Combining multiple mathematical operations into a single GPU kernel.
  • Why it works: It minimizes the "Memory Wall" bottleneck. By keeping intermediate results in fast SRAM (L1/Shared Memory) instead of writing them back to slow HBM, you reduce data movement significantly.
  • Deep Dive: In a fused kernel, the entire attention mechanism for a small "tile" of the matrices is performed by a single thread block. The threads use Warp-Level Primitives (like __shfl_sync) to communicate and perform reductions without ever hitting the L2 cache. This allows the model to compute attention in a single pass over the data.

2.2 FlashAttention Variants (v1, v2, v3)

FlashAttention isn’t just a kernel; it’s a re-imagining of how attention is calculated.

  • FlashAttention-1: Introduced Tiling and Recomputation. By computing the attention scores and the weighted sum in a single pass over the KV-cache, it avoided the $O(N^2)$ memory footprint entirely.
  • FlashAttention-2: Optimized for parallelism across the sequence length and query heads. It redistributed the work across SMs and optimized the "Online Softmax" update, reducing the non-compute instruction overhead.
  • FlashAttention-3: Designed for NVIDIA Hopper (H100). It utilizes the TMA (Tensor Memory Accelerator), a hardware block that moves tensors between HBM and Shared Memory asynchronously. By overlapping the Producer (TMA) and Consumer (Tensor Cores) warps, it achieves up to 740 TFLOPs/s (roughly 75% of peak BF16/FP16 throughput). Using FP8, it pushes past 1.2 PFLOPs/s.

2.3 RoPE Kernel Optimizations (Absorption Logic)

Rotary Positional Embeddings (RoPE) dominate the pre-attention phase.

  • Technique: Sin/Cos Pre-computation. Instead of calculating sines and cosines on the fly, pre-calculate the rotation values for all possible positions and store them in a small, fast lookup table in Constant Memory.
  • Absorption: In some advanced implementations, the RoPE rotation is mathematically absorbed into the Expansion Layer of the Attention kernel. This means the Query and Key rotation happens as the data is loaded into registers for the GEMM, effectively costing zero additional cycles.

3. High-Performance Execution & Graph Optimization

3.1 CUDA Graph Capture: Masterclass

The standard PyTorch execution model is "Eager Mode." Every line of code sends a command via the PCIe bus. For a model with 80 layers and 5-6 operations per layer, this results in hundreds of commands per token.

Architecture Diagram: CUDA Graph Capture

[ EAGER MODE ]                           [ CUDA GRAPH MODE ]
 (CPU)      (GPU)                         (CPU)         (GPU)
   |--Cmd-->  |                             |             |
   |<--Ack--  |                             |--Launch-->  | (Entire Graph)
   |--Cmd-->  |                             |             | [Step 1]
   |<--Ack--  |                             |             | [Step 2]
   |          |                             |             | [Step 3]
  • Technique: Record a "Static Graph."
  • Why it works: The CPU is completely removed from the inner loop. The GPU driver knows the exact memory addresses and kernel dependencies for the entire model.
  • Restriction: Static graphs require static tensor shapes. This is why many high-speed inference engines use "Padding" or "Bucketization" to ensure the input sequence length fits into a pre-defined graph size (e.g., 512, 1024, 2048).

3.2 Operator Fusion (Vertical & Horizontal)

  • Horizontal Fusion: If your model has 8 parallel attention heads, don’t launch 8 kernels. Combine them into one large kernel that processes all 8 heads in parallel.
  • Vertical Fusion: When you have a Bias Add followed by a GELU activation, merge them into the output phase of the preceding Linear layer kernel.
  • Impact: Reductions in kernel launch overhead and intermediate memory reads.

4. The Decode-Stage Architecture: Memory-Bound Reality

In LLM inference, the "Prefill" stage (processing input) is compute-bound, but the "Decode" stage (generating tokens one by one) is memory-bound.

5.1 PagedAttention: Internals of memory management

The KV-cache is the largest memory consumer in serving.

  • Fragmented Memory Problem: In standard systems, you must pre-allocate memory for the maximum possible sequence length. A request requiring 17 tokens would waste 2031 tokens worth of VRAM in a 2048-token context.
  • The PagedAttention Solution: Split the KV-cache into "Physical Blocks." The default Block Size is 16 tokens. This allows dynamic allocation and non-contiguous storage.
  • Optimal Tuning: While 16 is the standard, high-throughput systems often move to 128 tokens for H100/H200 clusters. Smaller blocks (16) reduce internal fragmentation, but larger blocks (128) improve GPU parallelism during the attention read pass.
  • Copy-on-Write: This also allows for efficient Prefix Sharing. Multiple requests sharing a system prompt can point to the same physical memory blocks, enabling massive fan-out at minimal memory cost.

5.2 Persistent Kernels for Decode

Launching a kernel for every layer in a 70B model results in 80 kernels.

  • Technique: A Persistent Kernel is launched once. It stays on the GPU and "polls" a memory flag to know when the next token is ready.
  • Advantage: It bypasses the driver’s kernel launch logic entirely, reducing the gap between token generations to the speed of the hardware itself.

5.3 Speculative Decoding: The Algorithmic Speedup

Instead of predicting one token with a 70B model, we predict 5 tokens with a 1B "Drafter" model and verify them in a single pass with the 70B "Verifier."

Architecture Diagram: Speculative Decoding Timeline

[ TIME T1 ] Drafter: "The", "cat", "sat", "on" (High Speed)
[ TIME T2 ] Verifier: Check all 4. 
            Result: [OK, OK, OK, FAIL]
            Token 4 rejected. Verifier corrected to "the".
[ OUTPUT ]  "The cat sat the" (3 tokens for price of 1)
  • The Math of Speculation: The cost of verifying 5 tokens is only marginally higher than generating 1 token, because the model can process all 5 "Drafted" tokens in a single parallel batch.

5. Memory & Hardware Alignment

5.4 Memory-Aligned Tensors

GPUs are most efficient when accessing memory in 128-byte or 256-byte aligned chunks.

  • Technique: Pad your tensor dimensions to be multiples of 8 or 16. For example, if your head size is 120, pad it to 128.
  • Impact: Enables Coalesced Memory Access, where multiple threads in a warp can be served by a single memory transaction. This doubles memory bandwidth utilization for small matrices.

5.5 Custom Sampling Kernels (Logits to Tokens)

The last step of any LLM is selecting the next token. Most frameworks move the logits back to the CPU or use a generic "Top-K" kernel that is unoptimized.

  • Technique: Implement an One-Pass Top-P (Nucleus) Sampling kernel. This kernel performs the Sort, Prefix Sum (Scan), and Random Sampling in a single sweep over the logits without global synchronization.

6. Distributed Parallelism: Scaling to the Cluster

6.1 Tensor Parallelism (TP)

Splitting a single matrix multiplication across multiple GPUs (Megatron-LM style).

  • Mechanism: CPU sends input to all GPUs. Each GPU computes a slice. They swap partial results via AllReduce.
  • Interconnect Factor: This is strictly designed for intra-node sharding. It requires the high bandwidth of NVLink (300GB/s - 1.8TB/s) or NVSwitch. Attempting TP over standard Ethernet or even InfiniBand (200-400Gbps) will cause communication latency to dwarf the compute gains.

6.2 Pipeline Parallelism (PP)

Splitting layers across GPUs (Layers 0-10 on GPU 1, 11-21 on GPU 2).

  • Optimization: 1F1B (One-Forward, One-Backward) schedule.
  • Bubble Formula: In a 1F1B schedule, the pipeline efficiency $\eta$ is roughly: \(\eta = \frac{M}{M + (S - 1)}\) where $M$ is the number of micro-batches and $S$ is the number of pipeline stages. To keep the “Bubble” (idle time) small, ensure that $M \gg S$.

7. Compiler & Logic Optimization

7.1 Triton: Beyond the DSL

Triton allows you to write C-like kernels in Python.

  • L2 Cache Swizzling: Triton automatically handles the order in which blocks are processed so that data is re-used from the L2 cache as much as possible, reducing trips to HBM.

7.2 Memory-Efficient Tensors (FP8 & BF16)

The transition from BF16 to FP8 (on Hopper/Blackwell) doubles the peak throughput.

  • Block-wise Scaling: Using different scale factors for every $32 \times 32$ block of weights to preserve precision during 8-bit quantization.

8. Case Study: Scaling to 1M Concurrent Users

Let’s look at how these 14 techniques combine in a real-world system designed for 1 million concurrent users.

  1. Orchestration: A global load balancer directs traffic to regional clusters of 8-GPU nodes.
  2. Model Layout: Within a node, we use Tensor Parallelism (TP=8) over NVLink to minimize latency for a Llama-3 70B model.
  3. Memory Management: PagedAttention is configured with a block size of 16. The “System Prompt” is cached across all requests, saving 40GB of VRAM per node.
  4. Inference Pass:
    • Prefill: Uses FlashAttention-3 to process the system prompt and history at 140 TeraFLOPs/s.
    • Decode: Swaps to a Persistent Decode Kernel.
    • Algorithmic Speedup: Speculative Decoding with a 1B Llama-distilled drafter achieves a token generation rate of 120 tokens/sec.
  5. Synchronization: CUDA Graphs are captured for 3 sequence-length buckets (128, 512, 1024), reducing the P99 latency by 15ms.

9. Summary Checklist: The Mastery Roadmap

If you are auditing an inference system for speed, check these 14 points:

  • Are attention kernels fused (FlashAttention-3)?
  • Is CUDA Graph Capture enabled for fixed-length buckets?
  • Are small successive layers vertically fused?
  • Is the kernel launch overhead masked by graphs or streams?
  • Is Tensor Parallelism restricted to NVLink domains?
  • Is Pipeline Parallelism utilizing optimal micro-batches?
  • Are the decode-step kernels persistent?
  • Is Speculative Decoding enabled for large models?
  • Are Sin/Cos RoPE factors pre-computed and absorbed?
  • Is the KV-cache managed via PagedAttention?
  • Are sampling kernels running on-GPU in a single pass?
  • Are all tensors memory-aligned to 128-byte boundaries?
  • Is kernel launch priority set to High for user requests?
  • Are you utilizing Triton for custom activation layers?

10. Conclusion: The Engineering of the Invisible

Speed in ML systems is rarely about a single "magic bullet." It is about a relentless focus on data movement and hardware utilization. When you master these 14 techniques, you aren’t just making a model faster; you are changing the economics of AI. You are moving from a world where your infrastructure dictates your capabilities to a world where your engineering dictates your competitive advantage.

The fastest model is the one that never waits for memory.


FAQ

What are fused attention kernels and why do they improve ML inference speed?

Fused kernels combine multiple operations like the QK matrix multiply, softmax normalization, and weighted sum into a single GPU kernel. This keeps intermediate results in fast SRAM (L1/Shared Memory) instead of writing them back to slow HBM global memory. Thread blocks use warp-level primitives for communication without hitting the L2 cache, dramatically reducing the memory wall bottleneck that dominates inference latency.

How do CUDA graphs speed up LLM inference?

CUDA graphs record the entire model computation as a static graph that the GPU executes without CPU involvement in the inner loop. In eager mode, each of the hundreds of operations per token requires a CPU command and GPU acknowledgment. With graphs, a single launch triggers the entire pre-recorded sequence. The tradeoff is that input tensor shapes must be static, requiring sequence length bucketization into pre-defined sizes.

What is the difference between FlashAttention versions 1, 2, and 3?

FlashAttention-1 introduced tiling and recomputation to avoid O(N-squared) memory for attention. FlashAttention-2 optimized parallelism across sequence length and query heads with improved online softmax updates. FlashAttention-3 exploits NVIDIA Hopper hardware features like the Tensor Memory Accelerator for asynchronous data movement, overlapping producer and consumer warps to achieve up to 740 TFLOPs/s in BF16 and over 1.2 PFLOPs/s in FP8.

How does speculative decoding achieve faster LLM token generation?

A small 1B drafter model predicts several tokens quickly, then the large model verifies all of them in a single parallel forward pass. Since the cost of verifying 5 tokens is only marginally higher than generating 1 token (the model processes all drafted tokens as a batch), each accepted draft saves a full generation round-trip. This typically achieves 1.8x to 2.5x throughput improvement with no quality degradation.


Originally published at: arunbaby.com/ml-system-design/0066-ml-inference-optimization-techniques

Want to work together?

I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.

Get in touch