vLLM — Project History
A chapter-based excavation of how vLLM grew from Woosuk Kwon’s solo PagedAttention prototype (codename CacheFlow) into the dominant open-source LLM inference engine.
Generated with the excavate skill on 2026-04-27, against the repo at commit
32e45636e(just afterv0.20.0was tagged).
Repo at a glance
Section titled “Repo at a glance”| First commit | e7d9d9c08 — Initial commit, 2023-02-09 by Woosuk Kwon |
| Latest commit (tip) | 32e45636e — [torch.compile]: Disable Sequence Parallelism (#38373), 2026-04, on the v0.20.x line |
| Total non-merge commits | ~16,100 in ~3 years 2 months |
| Tagged releases | v0.1.0 (Jun 2023) → v0.20.0 (Apr 2026), plus a pre-paper submission tag (Apr 2023) |
| Issue tracker | GitHub issues + the [RFC]: label as the canonical design-doc surface |
Top contributors (commits): Cyrus Leung 892, Woosuk Kwon 775, Michael Goin 509, youkaichao 472, Harry Mellor 470, Isotr0py 396, Nick Hill 351, Wentao Ye 276, Jee Jee Li 264, Roger Wang 217, Lucas Wilkinson 197, Simon Mo 188, Russell Bryant 177, Robert Shaw 176.
Activity curve:
- <100 commits/month through 2023.
- Ramps through 2024 (~200–400/mo).
- Steady 800–1000+ commits/month from mid-2025 onward.
Authoritative design docs: docs/design/ — arch_overview.md, paged_attention.md, torch_compile.md, attention_backends.md, prefix_caching.md, hybrid_kv_cache_manager.md, model_runner_v2.md, multiprocessing.md, metrics.md, dbo.md, fused_moe_modular_kernel.md, p2p_nccl_connector.md, mm_processing.md, optimization_levels.md, cuda_graphs.md, cuda_graphs_multimodal.md, fusions.md, plugin_system.md, …
How to read this book
Section titled “How to read this book”Each chapter is a self-contained markdown file under chapters/. Read them
in order, or jump to the era you care about. Chapters share a common shape:
- Why — one sentence motivation for the chapter.
- Timeline — anchor commits/PRs/issues with dates.
- Architecture before & after — what shape was the system in at the start of the chapter, what shape was it in at the end.
- Key decisions — the trade-offs that defined the chapter, with the commit/PR/issue link that committed the team to them.
- Q&A — seed questions worth challenging (the interactive part of an excavation).
For deeply-distilled “what I wish someone had told me” notes, see Architecture Insights.
Chapter map
Section titled “Chapter map”| # | Period | Releases | Title |
|---|---|---|---|
| 010 | 2023-02 → 2023-06 | submission → v0.1.0 | The PagedAttention prototype |
| 020 | 2023-07 → 2024-03 | v0.1.x → v0.3.x | Open-source launch & early ecosystem |
| 030 | 2024-04 → 2024-08 | v0.4.x → v0.5.x | Production hardening: prefix caching, VLMs, FP8 |
| 040 | 2024-09 → 2024-12 | v0.6.x | The performance push (2× throughput) |
| 050 | 2024-12 → 2025-03 | v0.7.0 → v0.8.0 | V1 engine — the rewrite |
| 060 | 2025-04 → 2025-10 | v0.9.0 → v0.11.0 | V0 sunset & distributed serving era |
| 070 | 2025-11 → 2026-02 | v0.12.0 → v0.16.0 | Multimodal maturity & MoE refactor |
| 080 | 2026-03 → 2026-04 | v0.17.0 → v0.20.0 | vLLM IR & the modern era |
Cross-cutting threads worth tracking
Section titled “Cross-cutting threads worth tracking”A few themes run through every chapter — keep them in mind as you read:
- The KV cache problem. Block manager → BlockManagerV2 → V1 KVCacheManager → hybrid KV cache manager → KV connectors → KV offloading. Every era of vLLM is, at some level, another iteration on “how do we manage the KV cache better.”
- CPU is the bottleneck on fast GPUs. This realisation is explicitly called out in the V1 RFC (#8779) and shapes nearly every architectural decision from 2025 onward (multi-process API server, stateful workers, async scheduling, removal of Python-side per-step bookkeeping).
- Hardware fan-out. NVIDIA-only → AMD ROCm (mid-2024) → Intel CPU/XPU → TPU →
Apple Silicon → AWS Neuron → ARM CPU → IBM Z. The
current_platform/attention_backends/ pluggable layer abstractions are all consequences of this. torch.compilefrom “experimental” to “load-bearing”. First serious integration is RFC #6378 (mid-2024); by V1 it is on by default; by 2026 it is the substrate for fusion passes, kernel selection, and the new vLLM IR (#33825).- Single process → multi-process. The shift from in-process Python loop (V0) to a
ZMQ-coordinated mesh of API server / engine core / workers (V1) is the single biggest
structural break in the project and is the lens through which to read everything from
v0.7.0onward — seedocs/design/multiprocessing.mdanddocs/design/arch_overview.md.
What to read first
Section titled “What to read first”- New to vLLM? Start at Chapter 010 and read forward.
- Joining the team in 2026? Chapters 050, 060, and 080 explain ~80% of the code you will touch on day one.
- Trying to land a kernel / backend? Chapter 040 (FlashInfer / async output) and Chapter 080 (vLLM IR, attention backend abstraction).
- Debugging a serving deployment? Chapter 060 (KV connectors, NIXL, disaggregation) is the load-bearing one.