Skip to content
Code Archaeology Read the codebase like a book. Argue with every chapter.

Chapter 050: V1 Engine — the rewrite (Dec 2024 – Mar 2025)

Releases: v0.7.0 (Jan 27 2025) → v0.8.0 (Mar 19 2025) Anchor RFC: #8779 — vLLM’s V1 Engine Architecture (the canonical V1 design doc on GitHub) Why: vLLM’s CPU-side overhead — Python scheduling, input broadcasting, sampler pythonization, detokenization — had become the dominant cost on H100-class hardware. The V1 rewrite is the team’s deliberate, RFC-driven reset of the engine architecture: stateful workers, single-step async scheduling, multi-process API server, prefix caching as a first-class feature, and torch.compile on by default.

DateAnchorWhat happened
2024-09 (issue created)#8779RFC posted by the maintainers (@WoosukKwon, @zhuohan123, @youkaichao, @simon-mo, @LiuXiaoxuanPKU, @comaniac, @alexm-neuralmagic, @njhill, @robertgshaw2-neuralmagic, @rkooo567).
2024-10-22#9289First V1 commit — vllm/v1/ directory created.
2024-11–2024-12manyIncremental V1 PRs (#9826, #10135, #10288, #10211, #10225, #10228, #10268, #9954, #10272, …).
2025-01-27v0.7.0V1 alpha (VLLM_USE_V1=1). 44 V1 commits in this release. torch.compile is now fully integrated and enabled by default in V1 (#11614). LLM.sleep, LLM.wake_up, LLM.collective_rpc for RLHF (#12361, #12084, #12284).
2025-02v0.7.xDeepSeek-V3 / R1 era. MLA on V1 (#13789, #14253, #14384, #14540, #14921). FlashMLA (#13747).
2025-03-19v0.8.0V1 enabled by default (#13726). EP for DeepSeek (#12583), DP attention (#13931), MTP (#13626). Pluggable scheduler (#14466).
┌──────────────────────────────────────────────────┐
│ Python process │
│ ├─ AsyncLLMEngine (busy loop) │
│ ├─ LLMEngine │
│ │ ├─ Scheduler (Python, per-step, on hot path)│
│ │ ├─ BlockManager │
│ │ └─ Executor │
│ │ └─ Worker(s) (also in this process, │
│ │ or remote Ray actors) │
│ └─ OpenAI server (FastAPI in same loop) │
└──────────────────────────────────────────────────┘
┌────────────────────┐ ZMQ ┌────────────────────┐ ZMQ ┌──────────────────┐
│ API server proc │◄──────►│ Engine core proc │◄──────►│ Worker proc(s) │
│ (FastAPI, OpenAI) │ │ - Scheduler │ │ - ModelRunner │
│ - request decode │ │ - KVCacheManager │ │ - stateful │
│ - tokenization │ │ - busy loop │ │ request state │
│ - mm preprocess │ │ - async output │ │ - executes fwd │
│ - SSE/HTTP streams │ │ processing │ │ - returns │
│ │ │ - 1 per DP rank │ │ sampled tok │
└────────────────────┘ └────────────────────┘ └──────────────────┘
1 (or N for DP) 1 per data-parallel TP × PP per
rank engine core

The full picture is in docs/design/arch_overview.md and docs/design/multiprocessing.md.

1. Stateful workers, “send only the diffs”

Section titled “1. Stateful workers, “send only the diffs””

The single biggest break with V0. In V0 the driver re-sent every request’s full state (block table, sampling params, etc.) to every worker every step. In V1, the worker holds the request state across steps. The driver sends:

  • New requests: full state.
  • In-flight requests: just {request_id, new_block_ids, num_new_tokens}.

The savings on broadcast time are huge at large TP — and this is the entire reason async scheduling becomes possible without input-prep on the critical path.

2. Async single-step scheduling, not multi-step

Section titled “2. Async single-step scheduling, not multi-step”

V1 explicitly rejects v0.6’s multi-step scheduler. Instead: while the worker runs step N, the scheduler is already producing inputs for step N+1. This requires (1) the scheduler is fast enough to fit in the GPU step (drives the move toward incrementally-built inputs), and (2) the detokenizer also moves to the driver (driver-side async detok). RFC: #20727.

A core V0 abstraction that supported beam search and parallel sampling. V1 makes each request a single sequence and emulates beam search via prefix caching (parallel sampling = N requests sharing a cached prefix). This is the bet that enables a much simpler scheduler hot path and is the reason “drop beam search” is in the V1 RFC top-line goals.

In V0 it was opt-in (--enable-prefix-caching). In V1 it’s a core scheduler assumption: every request hits the prefix cache; cache miss is the slow path. This drives the design of the new KVCacheManager (see docs/design/prefix_caching.md).

The team commits to TorchInductor as the model graph compiler. Per docs/design/torch_compile.md, V1 compiles each model with torch.compile, drives CUDA graphs through Inductor partitions, and uses compile-time fusion passes (RMSNorm+quant, RoPE+QKnorm, etc.) instead of hand-fused kernels. This is the bet that pays off in Chapter 060 (more fusions), Chapter 070 (multimodal compile), and Chapter 080 (vLLM IR replacing custom ops).

6. LLM.sleep / LLM.wake_up / LLM.collective_rpc for RLHF

Section titled “6. LLM.sleep / LLM.wake_up / LLM.collective_rpc for RLHF”

(#12361, #12084, #12284) Lets a training framework pause the engine, swap weights in via NCCL, and resume — without tearing the process down. This is what makes vLLM the de-facto rollout engine for RLHF (TRL, OpenRLHF, etc.). See AGENTS.md note: the project explicitly invests in “RLHF-friendly” features in this era.

7. The DeepSeek-V3 / R1 wave forces MLA + EP/DP through V1 (Feb–Mar 2025)

Section titled “7. The DeepSeek-V3 / R1 wave forces MLA + EP/DP through V1 (Feb–Mar 2025)”

DeepSeek-V3 dropped at the end of 2024 and DeepSeek-R1 in early 2025. Both:

  • Use Multi-head Latent Attention (MLA), which has a very different KV-cache shape from MHA/GQA (compressed latents projected up per-head).
  • Are MoE with hundreds of experts → benefit massively from expert parallelism (EP) and DP-attention.
  • Want MTP (multi-token prediction) for spec decode.

The team races to support all of this on V1, not on V0. Releases v0.7.x and v0.8.0 read like a single sustained DeepSeek-V3 enablement push: FlashMLA, MLA + chunked prefill (#12639), MLA on V1 (#13789 etc.), EP support (#12583), enable_expert_parallel (#14305), EP/TP MoE + DP attention (#13931), MTP (#13626). These features land on V1 because V1’s data structures (per-layer KV manager, pluggable attention backend, stateless process group) made them tractable. See docs/design/hybrid_kv_cache_manager.md.

Why move to a multi-process architecture? Doesn’t ZMQ add latency? ZMQ adds microseconds; Python’s GIL on the API server adds milliseconds at serving load. Splitting frees the API server’s event loop from the engine’s hot path, lets the engine run a tight C++/CUDA-bound busy loop, and makes data-parallelism a first-class concept (one engine core per DP rank). The latency cost is paid once per request boundary, not per step.

Why is the API server count auto-scaled to DP size? If you have 4 DP engine cores and 1 API server, the API server becomes the bottleneck under high QPS — exactly the failure mode V0 had. Scaling it to match means routing throughput grows with engine throughput. The flag --api-server-count exists for users who want to override.

Why is torch.compile opt-out by default in V1 instead of opt-in? Because every other architectural decision assumes the model goes through compile (graph-level fusions, CUDA graph partitioning, custom-op replacement, piecewise compilation for spec-decode). Making it opt-in would mean every kernel-fusion code path needs an “and also do this manually” branch. RFC #16501 makes this explicit (“vLLM x torch.compile caching should be opt-out by default”).

What did V1 give up? Beam search (now emulated externally), some V0-only model implementations (SupportsV0Only protocol — explicitly so the team could ship V1 without waiting for every model), and a long tail of V0-only features (some sampler options, structured output paths, etc.) that get re-added incrementally in v0.7.x and v0.8.x. The release notes for v0.7.0 list each one.

Why was DeepSeek-V3 enablement done on V1 and not V0? Because MLA’s KV layout doesn’t fit V0’s BlockSpaceManager cleanly — V0’s block layout is fundamentally MHA/GQA-shaped. V1’s KVCacheManager was designed with per-layer cache types as a first-class concept (RFC mentions “regular KV cache, Mamba cache, encoder cache” in its goals), so MLA fits naturally. This is also why hybrid SSM/Attention models (Jamba, Mamba, Qwen3-Next) only really work in V1.