[Bugfix][KVConnector] Fix SimpleCPUOffloadConnector GPU->CPU store race by Saddss · Pull Request #46278 · vllm-project/vllm

Saddss · 2026-06-21T09:19:17Z

Purpose

SimpleCPUOffloadConnector defers both loads and stores to get_finished(), which runs after model execution. It assumes that blocks whose tokens are host-side "confirmed" by the scheduler are also safe to read on the device, and therefore launches the GPU->CPU store with no cross-stream synchronization.

That assumption does not hold under v1 overlapped execution. The host runs ahead of the GPU, so the KV-write kernels for the blocks being stored may still be in flight on the compute stream when the store is launched on a separate copy stream. The copy backend additionally uses srcAccessOrder=ANY for every transfer, which allows the DMA engine to read the source before prior writes to it have completed.

The combination lets the store read partially written or stale KV blocks and silently corrupt the CPU cache. There is no error or crash; the only symptom is garbled output on later requests that hit the corrupted CPU blocks.

This is the same class of hazard that #31341 fixed for OffloadingConnector; SimpleCPUOffloadConnector reintroduced it because it dropped the cross-stream barrier.

Fix

Record a compute-done event on the current (compute) stream in get_finished() and have the copy backend wait on it before issuing the GPU->CPU store, so the store is ordered after the kernels that wrote those KV blocks.
Use srcAccessOrder=STREAM for stores (the source is the live KV cache) and keep srcAccessOrder=ANY for loads (the source is stable pinned host memory).
Loads are unchanged: they still launch immediately with no barrier, since they read from stable pinned host memory.

The fix preserves the connector's existing design (precise cudaHostRegister pinning, batched DMA, background copy thread) and only adds the missing happens-before edge for stores.

Test Plan

New unit test tests/v1/simple_kv_offload/test_worker.py:
- test_store_orders_after_compute_write: drives DmaCopyBackend directly and uses torch.cuda._sleep to deterministically delay the KV write. Asserts both directions so it is self-validating: the no-barrier control must corrupt (proving the race window is exercised) and the fixed path must be clean.
- test_get_finished_passes_wait_event_for_store_only: asserts get_finished() passes a compute-done event for stores and None for loads.
- test_build_params_src_access_order: asserts build_params defaults to ANY and honors an explicit STREAM override.
Command: pytest tests/v1/simple_kv_offload/test_worker.py -v

Test Result

Unit tests: 3 passed on the patched build. The same test file fails to even import against an unpatched build (the new symbols do not exist), confirming it covers the change.
Deterministic race microbenchmark over 150 iterations: without the barrier the store corrupts 150/150 iterations; with the barrier 0/150.
End-to-end with the production workload that originally surfaced [Bug]: SimpleCPUOffloadConnector corrupts restored KV under load → intermittent garbled output (missing GPU→CPU store cross-stream sync) #45704: garbled output drops from a steady non-zero rate to 0 with the fix applied.
Performance A/B with CPU offload enabled (same workload): output throughput and mean TPOT are within run-to-run noise (output throughput -0.47%, mean TPOT -0.1%), i.e. no measurable regression.

SimpleCPUOffloadConnector defers both loads and stores to get_finished(), which runs after model execution, and assumes that blocks whose tokens are host-side "confirmed" are also safe to read on the device. Under v1 overlapped execution the host runs ahead of the GPU, so the KV-write kernels for those blocks may still be in flight on the compute stream when the GPU->CPU store is launched on a separate copy stream. The store can then read partially written or stale KV blocks and silently corrupt the CPU cache, producing garbled output on later CPU cache hits. Order the store after the compute stream: record a compute-done event on the current stream in get_finished() and have the copy backend wait on it before issuing the GPU->CPU copy. Stores now use srcAccessOrder=STREAM (the source is the live KV cache); loads keep srcAccessOrder=ANY (the source is stable pinned host memory) and still launch immediately with no cross-stream barrier. Signed-off-by: Saddss <28726669061@qq.com>

aoshen02 · 2026-06-21T12:43:41Z

Hi, thanks for the contribution, I wonder do you guys ever try using mooncake store connector?

Saddss · 2026-06-21T13:57:17Z

Hi, thanks for the contribution, I wonder do you guys ever try using mooncake store connector?嗨，感谢您的贡献，我想知道你们有没有尝试过使用 mooncake store connector？

Thanks for the pointer! Yes, we're aware of MooncakeStoreConnector and it's a great piece of work — for setups with RDMA it clearly offers more (cross-instance prefix sharing, SSD tiering, GPUDirect transfers).

Our deployment is a bit of an awkward fit for it, though: the fleet is all RTX 5090 with no RDMA NICs. Mooncake's main advantage is the RDMA / GPUDirect path, and without an RDMA HCA the transfer engine falls back to TCP. We did test Mooncake as a shared DRAM pool, but over TCP the transfer bandwidth couldn't keep up with recompute — a KV cache hit ended up slower than just recomputing the prefix, so it was a net loss for us.

For a single-node, no-RDMA setup, SimpleCPUOffloadConnector fits well: it does a direct cuMemcpyBatchAsync between GPU HBM and local pinned host memory over PCIe, with no network stack in the path, so it saturates PCIe bandwidth.

Saddss · 2026-06-21T14:31:35Z

Contributor

Another reason we depend on SimpleCPUOffloadConnector specifically is precise host memory allocation. Our nodes have a hard whole-machine memory budget, so the CPU KV pool has to use exactly what we configure in order to maximize usable cache.

The legacy CPU offload path can't give us that, because of PyTorch's pinned caching allocator. It allocates the host pool as one torch.zeros(..., pin_memory=True) per KV cache tensor (vllm/v1/kv_offload/cpu/gpu_worker.py), and that allocator rounds every allocation up to the next power of two (roundSize = c10::llvm::PowerOf2Ceil(size) in aten/src/ATen/core/CachingHostAllocator.h; see also pytorch/pytorch#150517). Since the rounding is per tensor, the overshoot compounds. For our gemma-4 setup the pool is split into 5 CPU tensors: budgeting 60 GB gives ~12 GB per tensor, and PowerOf2Ceil(12 GB) = 16 GB, so the host actually allocates 5 × 16 = 80 GB. In the worst case (each tensor just above a power-of-two boundary) this approaches a 2x blow-up — under a fixed memory ceiling that either wastes a large fraction of the cache budget or pushes us into OOM.

SimpleCPUOffloadConnector avoids this by construction: it allocates plain (unpinned) memory and then pins it with cudaHostRegister, bypassing the caching host allocator entirely, so the pool is exactly the size we configure (55 GB stays 55 GB, not 64 GB). For a tight memory budget that exact sizing is a concrete operational reason we rely on this connector — and why the correctness fix in this PR matters for us.

ivanium

Thanks for the fix! Overall good; left a few comments

Reuse a single compute-done event across steps instead of allocating one per step, and move the load/store ordering rationale from inline comments into the get_finished docstring. Keep srcAccessOrder=STREAM for GPU->CPU stores (the source is the live KV cache; ANY may prefetch the source before the wait_event barrier fires), matching what OffloadingConnector landed in vllm-project#39306. Refresh the stale BatchMemcpyParams comment and reference the root issue vllm-project#45704. Signed-off-by: Saddss <28726669061@qq.com>

ivanium

Thanks for the effort! LGTM in general and I'd approve. I do suggest making another pass of comments to be more concise because agents tend to comment too much. I commented on two major ones but some other comment blocks worth double check too. Otherwise all good.

Drop the duplicated srcAccessOrder rationale (kept only on the enum definition), remove the copy-loop gate comment, and condense the store event and STREAM/ANY notes. No functional change. Signed-off-by: Saddss <28726669061@qq.com>

The STREAM-for-stores/ANY-for-loads mapping was restated in the params struct and build_params on top of the call site. Keep the per-direction rationale only at the decision point (copy_backend) and the semantics on the enum; drop the two restatements. No functional change. Signed-off-by: Saddss <28726669061@qq.com>

Saddss · 2026-06-22T07:22:38Z

Thanks for the effort! LGTM in general and I'd approve. I do suggest making another pass of comments to be more concise because agents tend to comment too much. I commented on two major ones but some other comment blocks worth double check too. Otherwise all good.

You're right, and thanks for calling it out. To be transparent about our workflow: we scope the problem and decide the fix direction, then have an agent drive the actual implementation and testing. It moves fast, but over-commenting is one of the tells, as you spotted. Did another pass: removed the gate comment in the copy loop, de-duplicated the STREAM/ANY rationale so it lives in a single place (semantics on the enum, the per-direction choice at the call site), and tightened the rest. Going forward I'll keep comments lean from the start instead of leaning on a review round to trim them. Appreciate the patient review.

ivanium · 2026-06-22T07:56:44Z

Thanks for the effort! LGTM in general and I'd approve. I do suggest making another pass of comments to be more concise because agents tend to comment too much. I commented on two major ones but some other comment blocks worth double check too. Otherwise all good.

You're right, and thanks for calling it out. To be transparent about our workflow: we scope the problem and decide the fix direction, then have an agent drive the actual implementation and testing. It moves fast, but over-commenting is one of the tells, as you spotted. Did another pass: removed the gate comment in the copy loop, de-duplicated the STREAM/ANY rationale so it lives in a single place (semantics on the enum, the per-direction choice at the call site), and tightened the rest. Going forward I'll keep comments lean from the start instead of leaning on a review round to trim them. Appreciate the patient review.

No worries and thanks for the nice fix! I have enabled CI and it should be merged once CI passes which I don't anticipate any issues

Saddss · 2026-06-22T09:48:02Z

Thanks for the effort! LGTM in general and I'd approve. I do suggest making another pass of comments to be more concise because agents tend to comment too much. I commented on two major ones but some other comment blocks worth double check too. Otherwise all good.

You're right, and thanks for calling it out. To be transparent about our workflow: we scope the problem and decide the fix direction, then have an agent drive the actual implementation and testing. It moves fast, but over-commenting is one of the tells, as you spotted. Did another pass: removed the gate comment in the copy loop, de-duplicated the STREAM/ANY rationale so it lives in a single place (semantics on the enum, the per-direction choice at the call site), and tightened the rest. Going forward I'll keep comments lean from the start instead of leaning on a review round to trim them. Appreciate the patient review.

No worries and thanks for the nice fix! I have enabled CI and it should be merged once CI passes which I don't anticipate any issues

There has only two failures, and both look unrelated to this PR:
e2e-core-1-gpu: test_cascade_attention[FLASH_ATTN] — minor output string mismatch
metrics-tracing-2-gpus: test_traces — SIGSEGV after generation finished
Could you retry those two failed jobs? @ivanium

Saddss · 2026-06-22T11:16:56Z

Thanks for the effort! LGTM in general and I'd approve. I do suggest making another pass of comments to be more concise because agents tend to comment too much. I commented on two major ones but some other comment blocks worth double check too. Otherwise all good.

You're right, and thanks for calling it out. To be transparent about our workflow: we scope the problem and decide the fix direction, then have an agent drive the actual implementation and testing. It moves fast, but over-commenting is one of the tells, as you spotted. Did another pass: removed the gate comment in the copy loop, de-duplicated the STREAM/ANY rationale so it lives in a single place (semantics on the enum, the per-direction choice at the call site), and tightened the rest. Going forward I'll keep comments lean from the start instead of leaning on a review round to trim them. Appreciate the patient review.

No worries and thanks for the nice fix! I have enabled CI and it should be merged once CI passes which I don't anticipate any issues

There has only two failures, and both look unrelated to this PR: e2e-core-1-gpu: test_cascade_attention[FLASH_ATTN] — minor output string mismatch metrics-tracing-2-gpus: test_traces — SIGSEGV after generation finished Could you retry those two failed jobs? @ivanium

CI is all green on. Could you help get it merged？Thanks😊

Order GPU->CPU stores after a compute-done event and use STREAM srcAccessOrder for stores (ANY for loads). Fixes silent KV corruption under concurrent load reported in vllm-project#45704 (upstream vllm-project#46278).

…ce (vllm-project#46278)

…ce (vllm-project#46278) Signed-off-by: Qiang Li <qiang.li2@amd.com>

Wire SimpleCPUOffloadConnector into the offloading stats / labeled-metrics API so its CPU KV transfers report through the shared vllm:kv_offload_* family (store/load bytes/time/size), plus per-connector CPU-pool gauges (vllm:simple_cpu_offload_*: total/free/used blocks, usage, pending loads/stores). Transfer timing brackets only the DMA: host-side batch-argument preparation is split out of copy_blocks() into prepare_copy()/launch_prepared_copy() and runs before the start CUDA event, and the vllm-project#46278 compute-done wait stays ordered before it, so kv_offload_*_time excludes host enqueue and compute-wait latency. Timing events are recycled via a small pool. Supersedes vllm-project#41790 (built on the pre-vllm-project#45957 metrics API); preserves vllm-project#43877 (scheduler post-completion stats aggregation) and vllm-project#46278 (GPU->CPU store barrier + srcAccessOrder). Tests: tests/v1/simple_kv_offload/test_metrics.py covers stats helpers/reduce, prom observe() end-to-end, serialized round-trip, scheduler pool gauges, worker transfer recording + reset-on-read, and event-pool reuse (no GPU required). AI assistance was used; every line was human-reviewed. Co-authored-by: OCWC22 <OCWC22@users.noreply.github.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Change72 <cguo51@asu.edu>

Wire SimpleCPUOffloadConnector into the offloading stats / labeled-metrics API so its CPU KV transfers report through the shared vllm:kv_offload_* family (store/load bytes/time/size), plus per-connector CPU-pool gauges (vllm:simple_cpu_offload_*: total/free/used blocks, usage, pending loads/stores). Transfer timing brackets only the DMA: host-side batch-argument preparation is split out of copy_blocks() into prepare_copy()/launch_prepared_copy() and runs before the start CUDA event; the vllm-project#46278 compute-done wait is enqueued before the prepare so it captures the shared compute event's current state and stays outside the timing bracket. So kv_offload_*_time excludes host enqueue and compute-wait latency. Timing events are recycled via a pool. Supersedes vllm-project#41790 (built on the pre-vllm-project#45957 metrics API); preserves vllm-project#43877 (scheduler post-completion stats aggregation) and vllm-project#46278 (GPU->CPU store barrier + srcAccessOrder). Tests: tests/v1/simple_kv_offload/test_metrics.py covers stats build/reduce, prom observe() end-to-end (transfer + pool gauges), scheduler pool gauges, worker transfer recording + reset-on-read, and event-pool reuse (no GPU required); tests/v1/simple_kv_offload/test_worker.py store-ordering still passes on GPU. AI assistance was used. Co-authored-by: OCWC22 <OCWC22@users.noreply.github.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Change72 <cguo51@asu.edu>

mergify Bot added nvidia v1 bug Something isn't working labels Jun 21, 2026

github-project-automation Bot added this to NVIDIA Jun 21, 2026

ZJY0516 requested a review from ivanium June 21, 2026 09:29

ivanium requested changes Jun 22, 2026

View reviewed changes

Comment thread vllm/v1/simple_kv_offload/cuda_mem_ops.py Outdated

Comment thread vllm/v1/simple_kv_offload/worker.py Outdated

Comment thread vllm/v1/simple_kv_offload/worker.py Outdated

Comment thread vllm/v1/simple_kv_offload/worker.py Outdated

github-project-automation Bot moved this to In review in NVIDIA Jun 22, 2026

ivanium approved these changes Jun 22, 2026

View reviewed changes

Comment thread vllm/v1/simple_kv_offload/copy_backend.py Outdated

Comment thread vllm/v1/simple_kv_offload/cuda_mem_ops.py Outdated

Comment thread vllm/v1/simple_kv_offload/cuda_mem_ops.py Outdated

[Bugfix][KVConnector] Trim verbose comments per review

16172e9

Drop the duplicated srcAccessOrder rationale (kept only on the enum definition), remove the copy-loop gate comment, and condense the store event and STREAM/ANY notes. No functional change. Signed-off-by: Saddss <28726669061@qq.com>

ivanium added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 22, 2026

majunze2001 mentioned this pull request Jun 22, 2026

[SimpleCPUOffloadConnector] Fix remaining global→block conversions under PCP/DCP #46394

Merged

WoosukKwon approved these changes Jun 22, 2026

View reviewed changes

github-project-automation Bot moved this from In review to Ready in NVIDIA Jun 22, 2026

WoosukKwon merged commit 82ede09 into vllm-project:main Jun 22, 2026
68 checks passed

github-project-automation Bot moved this from Ready to Done in NVIDIA Jun 22, 2026

Saddss deleted the fix/simple-cpu-offload-store-cross-stream-sync branch June 23, 2026 03:12

nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026

[Bugfix][KVConnector] Fix SimpleCPUOffloadConnector GPU->CPU store ra…

f69176e

…ce (vllm-project#46278)

qli88 pushed a commit to qli88/vllm that referenced this pull request Jun 26, 2026

[Bugfix][KVConnector] Fix SimpleCPUOffloadConnector GPU->CPU store ra…

dc9426a

…ce (vllm-project#46278) Signed-off-by: Qiang Li <qiang.li2@amd.com>

Change72 mentioned this pull request Jul 1, 2026

[KV Offload] Expose SimpleCPU offload metrics #41790

Open

This was referenced Jul 1, 2026

[Bug] SimpleCPUOffload: residual output garbling under high concurrency — load path lacks cross-stream sync with compute (WAR), symmetric gap left by #46278 #47282

Open

[Bugfix][V1] SimpleCPUOffload: order CPU->GPU loads after compute (WAR fix) #47291

Open

ZAID646 mentioned this pull request Jul 1, 2026

Fix SimpleCPUOffload load-path race — add cross-stream sync (WAR) #47324

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix][KVConnector] Fix SimpleCPUOffloadConnector GPU->CPU store race#46278

[Bugfix][KVConnector] Fix SimpleCPUOffloadConnector GPU->CPU store race#46278
WoosukKwon merged 4 commits into
vllm-project:mainfrom
Saddss:fix/simple-cpu-offload-store-cross-stream-sync

Saddss commented Jun 21, 2026

aoshen02 commented Jun 21, 2026

Saddss commented Jun 21, 2026

Saddss commented Jun 21, 2026

ivanium left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ivanium left a comment

Uh oh!

Uh oh!

Uh oh!

Saddss commented Jun 22, 2026

ivanium commented Jun 22, 2026

Saddss commented Jun 22, 2026

Saddss commented Jun 22, 2026

Uh oh!

Labels

4 participants

Uh oh!

Uh oh!

Conversation

Saddss commented Jun 21, 2026

Purpose

Fix

Test Plan

Test Result

aoshen02 commented Jun 21, 2026

Saddss commented Jun 21, 2026

Saddss commented Jun 21, 2026

ivanium left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ivanium left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Saddss commented Jun 22, 2026

ivanium commented Jun 22, 2026

Saddss commented Jun 22, 2026

Saddss commented Jun 22, 2026

Uh oh!

Labels

4 participants