Skip to content
Code Archaeology Read the codebase like a book. Argue with every chapter.

vLLM — Project History

A chapter-based excavation of how vLLM grew from Woosuk Kwon’s solo PagedAttention prototype (codename CacheFlow) into the dominant open-source LLM inference engine.

Generated with the excavate skill on 2026-04-27, against the repo at commit 32e45636e (just after v0.20.0 was tagged).

First commite7d9d9c08Initial commit, 2023-02-09 by Woosuk Kwon
Latest commit (tip)32e45636e[torch.compile]: Disable Sequence Parallelism (#38373), 2026-04, on the v0.20.x line
Total non-merge commits~16,100 in ~3 years 2 months
Tagged releasesv0.1.0 (Jun 2023) → v0.20.0 (Apr 2026), plus a pre-paper submission tag (Apr 2023)
Issue trackerGitHub issues + the [RFC]: label as the canonical design-doc surface

Top contributors (commits): Cyrus Leung 892, Woosuk Kwon 775, Michael Goin 509, youkaichao 472, Harry Mellor 470, Isotr0py 396, Nick Hill 351, Wentao Ye 276, Jee Jee Li 264, Roger Wang 217, Lucas Wilkinson 197, Simon Mo 188, Russell Bryant 177, Robert Shaw 176.

Activity curve:

  • <100 commits/month through 2023.
  • Ramps through 2024 (~200–400/mo).
  • Steady 800–1000+ commits/month from mid-2025 onward.

Authoritative design docs: docs/design/arch_overview.md, paged_attention.md, torch_compile.md, attention_backends.md, prefix_caching.md, hybrid_kv_cache_manager.md, model_runner_v2.md, multiprocessing.md, metrics.md, dbo.md, fused_moe_modular_kernel.md, p2p_nccl_connector.md, mm_processing.md, optimization_levels.md, cuda_graphs.md, cuda_graphs_multimodal.md, fusions.md, plugin_system.md, …

Each chapter is a self-contained markdown file under chapters/. Read them in order, or jump to the era you care about. Chapters share a common shape:

  • Why — one sentence motivation for the chapter.
  • Timeline — anchor commits/PRs/issues with dates.
  • Architecture before & after — what shape was the system in at the start of the chapter, what shape was it in at the end.
  • Key decisions — the trade-offs that defined the chapter, with the commit/PR/issue link that committed the team to them.
  • Q&Aseed questions worth challenging (the interactive part of an excavation).

For deeply-distilled “what I wish someone had told me” notes, see Architecture Insights.

#PeriodReleasesTitle
0102023-02 → 2023-06submissionv0.1.0The PagedAttention prototype
0202023-07 → 2024-03v0.1.xv0.3.xOpen-source launch & early ecosystem
0302024-04 → 2024-08v0.4.xv0.5.xProduction hardening: prefix caching, VLMs, FP8
0402024-09 → 2024-12v0.6.xThe performance push (2× throughput)
0502024-12 → 2025-03v0.7.0v0.8.0V1 engine — the rewrite
0602025-04 → 2025-10v0.9.0v0.11.0V0 sunset & distributed serving era
0702025-11 → 2026-02v0.12.0v0.16.0Multimodal maturity & MoE refactor
0802026-03 → 2026-04v0.17.0v0.20.0vLLM IR & the modern era

A few themes run through every chapter — keep them in mind as you read:

  1. The KV cache problem. Block manager → BlockManagerV2 → V1 KVCacheManager → hybrid KV cache manager → KV connectors → KV offloading. Every era of vLLM is, at some level, another iteration on “how do we manage the KV cache better.”
  2. CPU is the bottleneck on fast GPUs. This realisation is explicitly called out in the V1 RFC (#8779) and shapes nearly every architectural decision from 2025 onward (multi-process API server, stateful workers, async scheduling, removal of Python-side per-step bookkeeping).
  3. Hardware fan-out. NVIDIA-only → AMD ROCm (mid-2024) → Intel CPU/XPU → TPU → Apple Silicon → AWS Neuron → ARM CPU → IBM Z. The current_platform / attention_backends / pluggable layer abstractions are all consequences of this.
  4. torch.compile from “experimental” to “load-bearing”. First serious integration is RFC #6378 (mid-2024); by V1 it is on by default; by 2026 it is the substrate for fusion passes, kernel selection, and the new vLLM IR (#33825).
  5. Single process → multi-process. The shift from in-process Python loop (V0) to a ZMQ-coordinated mesh of API server / engine core / workers (V1) is the single biggest structural break in the project and is the lens through which to read everything from v0.7.0 onward — see docs/design/multiprocessing.md and docs/design/arch_overview.md.
  • New to vLLM? Start at Chapter 010 and read forward.
  • Joining the team in 2026? Chapters 050, 060, and 080 explain ~80% of the code you will touch on day one.
  • Trying to land a kernel / backend? Chapter 040 (FlashInfer / async output) and Chapter 080 (vLLM IR, attention backend abstraction).
  • Debugging a serving deployment? Chapter 060 (KV connectors, NIXL, disaggregation) is the load-bearing one.