Flash-MoE on MacBook: running 397B parameters on consumer hardware
The first thing you notice when Flash-MoE loads Qwen3.5-397B is that it works. No caveats about reduced functionality. No warning to expect terrible throughput. You type a prompt and the model responds.
TL;DR
Flash-MoE streams Qwen3.5-397B-A17B from SSD, hitting 4.36 tok/s at 4-bit quantization on a MacBook Pro M3 Max 48GB — no Python, no framework overhead. The trick is MoE sparse activation: only 17B of 397B parameters fire per token, so you never need the full model in RAM. Use it for batch jobs and privacy-sensitive work; stick with API for real-time interaction. See model serving fundamentals for the production deployment context.

The conventional wisdom on running frontier-scale models locally goes like this: you need a server rack, serious VRAM, and probably a cloud bill anyway. Flash-MoE, built by Dan Woods (VP of AI Platforms at CVS Health) in roughly 24 hours using Claude Code, does something that violates that assumption. It runs a 397 billion parameter model on a consumer laptop in roughly 7,000 lines of C and 1,200 lines of Metal GPU shaders.
The project scored 393 points and 121 comments on Hacker News in March 2026 — a genuine community moment, not marketing. Understanding why it works tells you more about MoE architecture than a hundred architecture diagrams. And understanding where it breaks tells you whether it belongs in your workflow.
Why MoE models are uniquely suited to consumer hardware
Qwen3.5-397B-A17B activates exactly 17 billion parameters per forward pass, not 397 billion. The 397B figure describes total capacity, not per-token compute. This distinction is everything. A standard dense 70B model activates all 70B parameters for every token; Qwen3.5’s 397B model activates fewer.
The architecture achieves this through expert routing. Each transformer layer holds 512 expert modules. A learned router selects K=4 of them for each token. The remaining 508 experts sit idle — present in the model weights, contributing nothing to that token’s computation. Run this across 60 transformer layers and you get a model that is simultaneously enormous (in stored capacity) and lean (in per-token compute).
Flash-MoE exploits this asymmetry directly. Rather than loading the full 209GB model into RAM, it streams only the 4 active experts per layer from NVMe SSD. Apple’s M3 Max SSD reads at 17.5 GB/s; each expert module weighs roughly 6.75MB. Loading 4 experts takes about a millisecond. The remaining 508 experts per layer never touch memory. The engine itself occupies roughly 5.5GB of RAM during inference, leaving the rest for the operating system page cache — which naturally achieves about a 71% expert hit rate without any custom caching logic.
For comparison, running a 70B dense model at comparable quality requires fitting the entire weight matrix in memory or tolerating heavy CPU-GPU swap. MoE gives you the quality ceiling of a 397B model with the active-compute footprint of a 17B model. That asymmetry is what makes Flash-MoE structurally feasible, not a clever hack.
MoE routing per token (K=4 from 512 experts):
Token input
│
▼
Router network
│
├─── Expert 7 ◄── activated
├─── Expert 83 ◄── activated
├─── Expert 204 ◄── activated
├─── Expert 391 ◄── activated
│
└─── Expert 1..512 (remaining 508: skipped)
What Flash-MoE actually does differently
Flash-MoE bypasses every assumption baked into Python inference stacks — no PyTorch, no transformers library, no CUDA abstractions. It is roughly 7,000 lines of C and Objective-C plus 1,200 lines of hand-tuned Metal GPU shaders. The entire serving path runs on Apple Silicon without a Python interpreter in sight.
Three design decisions define the engine. First: SSD expert streaming. Rather than building a custom expert cache, the engine trusts macOS’s unified memory page cache. With 42GB of the laptop’s 48GB available after the engine loads, the OS page cache holds frequently accessed experts naturally. The 71% cache hit rate emerges without explicit management.
Second: FMA-optimized dequantization kernels. The 4-bit quantized weights must be dequantized before Metal can compute on them. Woods wrote custom FMA (fused multiply-add) kernels for this step that run 12% faster than naive dequantization — a measurable gain at 4.36 tok/s where every millisecond shows up in throughput.
Third: serial GPU→SSD→GPU pipeline. Apple Silicon has a unified memory controller. Naive concurrent reads and writes contend on it, creating a bottleneck that Woods documented across 58 ablation experiments. The serial pipeline — compute on GPU, stream from SSD, compute again — respects the hardware’s constraints instead of fighting them.
The model structure itself is unusual: 60 transformer layers split between 45 GatedDeltaNet layers and 15 full attention layers. GatedDeltaNet is a state-space variant that processes faster than full attention while maintaining quality at longer contexts. Qwen3.5-397B-A17B supports a 262,144 token context natively.
The real performance profile: tokens/sec, quality, memory pressure
At 4-bit quantization, Flash-MoE achieves 4.36 tokens/second on an M3 Max MacBook Pro — with full tool calling support and production-quality output. That is the number to hold in your head. Everything else is context.
┌─────────────────────────────────────────────────────────┐
│ Flash-MoE performance profile (M3 Max 48GB) │
├────────────────┬──────────┬──────────┬──────────────────┤
│ Mode │ Tok/sec │ Disk │ Tool calling │
├────────────────┼──────────┼──────────┼──────────────────┤
│ 4-bit (prod) │ 4.36 │ 209 GB │ Yes │
│ 4-bit (warm) │ 4.80 │ 209 GB │ Yes │
│ 2-bit (exp) │ 5.74 │ 120 GB │ No (malformed) │
│ Peak single │ 7.05 │ 209 GB │ — │
├────────────────┼──────────┼──────────┼──────────────────┤
│ Cloud API │ 20-30 │ — │ Yes │
└────────────────┴──────────┴──────────┴──────────────────┘
The 2-bit mode is where it gets interesting. A 30% speed gain sounds compelling — 5.74 tok/s versus 4.36 tok/s — until you discover it breaks tool calling. The 2-bit quantization reduces expert weights enough that JSON generation becomes unreliable, producing malformed output that fails tool parsers. For anything touching agentic workflows, 2-bit is not usable. For pure text generation tasks — summarization, analysis, drafting — it may be worth evaluating.
K=4 expert routing (reduced from the model’s default K=11) is the other tuning knob. Woods documented that K=3 causes immediate quality collapse. At K=4 there is no measurable quality degradation versus K=11. This matters: it means you get the speed benefit of loading fewer experts without trading accuracy.
Memory pressure under load is lighter than you’d expect. The engine occupies about 5.5GB of RAM. The remaining 42GB is page cache for expert modules, with the OS naturally warming the cache as requests come in. Cold-start latency for the first token is measurably higher than subsequent tokens — the first few experts load from a cold SSD. After a few prompts, common experts stay cached and throughput stabilizes at the 4.36 tok/s figure.
Quality versus Qwen3-72B: measurably better. This is expected — Qwen3.5-397B-A17B has a substantially larger expert capacity. The trade-off is throughput, not capability.
When local makes sense vs. API
The decision is not “is 4.36 tok/s fast enough?” It is “what does your workload actually need?” These are different questions.
Local inference wins in three scenarios. The first is privacy. Some data genuinely cannot leave the machine — legal documents, medical notes, internal codebases. At 4.36 tok/s, processing a 50,000-word document takes about 3 minutes. For an overnight batch job on sensitive data, that is completely viable.
The second is batch throughput cost. Cloud API pricing for Qwen3.5-397B runs on the order of dollars per million tokens (OpenRouter lists providers at varying rates). If you are running millions of tokens per day for analysis, summarization, or evaluation tasks, the math on local inference changes. The hardware cost is fixed; the per-token cost approaches zero.
The third is latency profile mismatch. This sounds counterintuitive — 4.36 tok/s is slow. But API round-trips for short prompts can add 200-800ms of network latency on top of generation time. For prompts that generate short responses (classification, extraction, yes/no judgments), local inference can return before a cloud API finishes its network handshake.
This sounds great until you hit the interactive use case. A human reading at normal pace can absorb about 4-5 words per second — which maps roughly to 5-6 tokens per second. At 4.36 tok/s, Flash-MoE runs just below comfortable reading speed. For a conversation where you are watching tokens stream, the experience feels slightly slow. Cloud APIs at 20-30 tok/s feel instantaneous by comparison.
flowchart TD
A[New inference workload] --> B{Data leave device?}
B -->|Cannot| C[Local: Flash-MoE]
B -->|Can| D{Latency requirement?}
D -->|Real-time interactive| E[API]
D -->|Batch / background| F{Volume?}
F -->|High volume daily| G[Local: cost wins]
F -->|Low / unpredictable| H[API: simpler ops]
The operational cost of Flash-MoE is also real. It requires Apple Silicon with 48GB unified memory, a 209GB model download, Metal development tooling to build from source, and comfort with C compilation. This is not pip install territory. For teams already running Python inference stacks, the switching cost is non-trivial. For individual researchers or privacy-first workflows, it is a one-time setup.
Agent workflows sit in a middle ground. Flash-MoE supports tool calling at 4-bit — the agent deployment patterns you’d use are the same. But agents that call tools repeatedly and wait on responses are sensitive to per-step latency. A multi-step agent loop at 4.36 tok/s per step adds up. For agents running batch research overnight, this is irrelevant. For agents embedded in interactive products, it is a constraint.
The cost management framework for agents applies directly here: model routing based on task type. Use Flash-MoE locally for heavy analytical steps; route to API for interactive, latency-sensitive steps. A hybrid approach is the honest answer for most workloads.
What to watch next
Flash-MoE is one person’s 24-hour project. The ablation experiments (58 of them, documented in the repo) show how much headroom remains. The current serial GPU→SSD→GPU pipeline leaves expert prefetching unexplored. An overlapped pipeline — streaming the next layer’s experts while the GPU computes the current layer — could eliminate SSD latency from the critical path entirely.
Apple’s M4 and M5 chips bring higher memory bandwidth and faster SSD read speeds. Flash-MoE’s architecture benefits directly from both without code changes.
The deeper implication is architectural. Qwen3.5-397B-A17B is not a special case — it is Alibaba’s bet that MoE is the right scaling architecture for open weights. If Alibaba, Mistral, and others continue releasing frontier-class MoE models, the class of models runnable on consumer hardware via SSD streaming grows. Dense models are stuck at the memory wall. MoE models have a structural escape route.
FAQ
What is Flash-MoE and how does it work?
Flash-MoE is a pure C and Metal inference engine built by Dan Woods (VP of AI at CVS Health) that runs Qwen3.5-397B-A17B on a MacBook Pro with 48GB unified memory. It streams only the 4 active expert modules per layer directly from NVMe SSD at 17.5 GB/s rather than loading the 209GB model into RAM. The engine uses roughly 5.5GB of memory during inference, with the rest serving as page cache.
How fast is Flash-MoE on a MacBook Pro M3 Max?
At 4-bit quantization — the production-quality setting — Flash-MoE achieves 4.36 tokens/second with full tool calling support. A 2-bit experimental mode hits 5.74 tok/s (about 30% faster) but breaks tool calling by generating malformed JSON. Cloud APIs for the same model typically run 20-30 tok/s. Peak single-token throughput reaches 7.05 tok/s in ideal conditions.
Why can a 397B parameter model fit in 48GB of RAM?
Because Qwen3.5-397B-A17B is a Mixture-of-Experts model that activates only 17B parameters per token — not all 397B. Flash-MoE takes this further by streaming just the 4 needed expert modules from SSD on-the-fly, keeping the memory footprint at roughly 5.5GB. The 209GB model lives on disk, not in RAM.
When does running Flash-MoE locally make more sense than using an API?
Local inference wins when you process sensitive data that can’t leave your machine, when batch workloads run overnight and throughput matters more than per-request speed, or when API costs accumulate on high-volume tasks. API wins for real-time interactive work where 4.36 tok/s feels sluggish.
Does Flash-MoE work on any Mac, or only specific models?
Flash-MoE currently targets Apple Silicon exclusively — it requires Metal GPU shaders and the M3 Max’s 17.5 GB/s SSD read speed. The 48GB unified memory configuration is the minimum tested. The technique is also MoE-specific: it does not apply to dense models like Llama or Mistral, which have no expert routing to exploit.
Want to work together?
I take on projects, advisory roles, and fractional CTO engagements in AI/ML. I also help businesses go AI-native with agentic workflows and agent orchestration.
Get in touch