Skip to content

[Bugfix][KVConnector] Fix SimpleCPUOffloadConnector GPU->CPU store race#46278

Merged
WoosukKwon merged 4 commits into
vllm-project:mainfrom
Saddss:fix/simple-cpu-offload-store-cross-stream-sync
Jun 22, 2026
Merged

[Bugfix][KVConnector] Fix SimpleCPUOffloadConnector GPU->CPU store race#46278
WoosukKwon merged 4 commits into
vllm-project:mainfrom
Saddss:fix/simple-cpu-offload-store-cross-stream-sync

Conversation

@Saddss

@Saddss Saddss commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Purpose

Fixes #45704.

SimpleCPUOffloadConnector defers both loads and stores to get_finished(), which runs after model execution. It assumes that blocks whose tokens are host-side "confirmed" by the scheduler are also safe to read on the device, and therefore launches the GPU->CPU store with no cross-stream synchronization.

That assumption does not hold under v1 overlapped execution. The host runs ahead of the GPU, so the KV-write kernels for the blocks being stored may still be in flight on the compute stream when the store is launched on a separate copy stream. The copy backend additionally uses srcAccessOrder=ANY for every transfer, which allows the DMA engine to read the source before prior writes to it have completed.

The combination lets the store read partially written or stale KV blocks and silently corrupt the CPU cache. There is no error or crash; the only symptom is garbled output on later requests that hit the corrupted CPU blocks.

This is the same class of hazard that #31341 fixed for OffloadingConnector; SimpleCPUOffloadConnector reintroduced it because it dropped the cross-stream barrier.

Fix

  • Record a compute-done event on the current (compute) stream in get_finished() and have the copy backend wait on it before issuing the GPU->CPU store, so the store is ordered after the kernels that wrote those KV blocks.
  • Use srcAccessOrder=STREAM for stores (the source is the live KV cache) and keep srcAccessOrder=ANY for loads (the source is stable pinned host memory).
  • Loads are unchanged: they still launch immediately with no barrier, since they read from stable pinned host memory.

The fix preserves the connector's existing design (precise cudaHostRegister pinning, batched DMA, background copy thread) and only adds the missing happens-before edge for stores.

Test Plan

  • New unit test tests/v1/simple_kv_offload/test_worker.py:
    • test_store_orders_after_compute_write: drives DmaCopyBackend directly and uses torch.cuda._sleep to deterministically delay the KV write. Asserts both directions so it is self-validating: the no-barrier control must corrupt (proving the race window is exercised) and the fixed path must be clean.
    • test_get_finished_passes_wait_event_for_store_only: asserts get_finished() passes a compute-done event for stores and None for loads.
    • test_build_params_src_access_order: asserts build_params defaults to ANY and honors an explicit STREAM override.
  • Command: pytest tests/v1/simple_kv_offload/test_worker.py -v

Test Result

  • Unit tests: 3 passed on the patched build. The same test file fails to even import against an unpatched build (the new symbols do not exist), confirming it covers the change.
  • Deterministic race microbenchmark over 150 iterations: without the barrier the store corrupts 150/150 iterations; with the barrier 0/150.
  • End-to-end with the production workload that originally surfaced [Bug]: SimpleCPUOffloadConnector corrupts restored KV under load → intermittent garbled output (missing GPU→CPU store cross-stream sync) #45704: garbled output drops from a steady non-zero rate to 0 with the fix applied.
  • Performance A/B with CPU offload enabled (same workload): output throughput and mean TPOT are within run-to-run noise (output throughput -0.47%, mean TPOT -0.1%), i.e. no measurable regression.
SimpleCPUOffloadConnector defers both loads and stores to get_finished(),
which runs after model execution, and assumes that blocks whose tokens are
host-side "confirmed" are also safe to read on the device. Under v1
overlapped execution the host runs ahead of the GPU, so the KV-write
kernels for those blocks may still be in flight on the compute stream when
the GPU->CPU store is launched on a separate copy stream. The store can
then read partially written or stale KV blocks and silently corrupt the
CPU cache, producing garbled output on later CPU cache hits.

Order the store after the compute stream: record a compute-done event on
the current stream in get_finished() and have the copy backend wait on it
before issuing the GPU->CPU copy. Stores now use srcAccessOrder=STREAM
(the source is the live KV cache); loads keep srcAccessOrder=ANY (the
source is stable pinned host memory) and still launch immediately with no
cross-stream barrier.

Signed-off-by: Saddss <28726669061@qq.com>
@mergify mergify Bot added nvidia v1 bug Something isn't working labels Jun 21, 2026
@ZJY0516 ZJY0516 requested a review from ivanium June 21, 2026 09:29
@aoshen02

Copy link
Copy Markdown
Collaborator

Hi, thanks for the contribution, I wonder do you guys ever try using mooncake store connector?

@Saddss

Saddss commented Jun 21, 2026

Copy link
Copy Markdown
Contributor Author

Hi, thanks for the contribution, I wonder do you guys ever try using mooncake store connector?嗨,感谢您的贡献,我想知道你们有没有尝试过使用 mooncake store connector?

Thanks for the pointer! Yes, we're aware of MooncakeStoreConnector and it's a great piece of work — for setups with RDMA it clearly offers more (cross-instance prefix sharing, SSD tiering, GPUDirect transfers).

Our deployment is a bit of an awkward fit for it, though: the fleet is all RTX 5090 with no RDMA NICs. Mooncake's main advantage is the RDMA / GPUDirect path, and without an RDMA HCA the transfer engine falls back to TCP. We did test Mooncake as a shared DRAM pool, but over TCP the transfer bandwidth couldn't keep up with recompute — a KV cache hit ended up slower than just recomputing the prefix, so it was a net loss for us.

For a single-node, no-RDMA setup, SimpleCPUOffloadConnector fits well: it does a direct cuMemcpyBatchAsync between GPU HBM and local pinned host memory over PCIe, with no network stack in the path, so it saturates PCIe bandwidth.

@Saddss

Saddss commented Jun 21, 2026

Copy link
Copy Markdown
Contributor Author

Contributor

Another reason we depend on SimpleCPUOffloadConnector specifically is precise host memory allocation. Our nodes have a hard whole-machine memory budget, so the CPU KV pool has to use exactly what we configure in order to maximize usable cache.

The legacy CPU offload path can't give us that, because of PyTorch's pinned caching allocator. It allocates the host pool as one torch.zeros(..., pin_memory=True) per KV cache tensor (vllm/v1/kv_offload/cpu/gpu_worker.py), and that allocator rounds every allocation up to the next power of two (roundSize = c10::llvm::PowerOf2Ceil(size) in aten/src/ATen/core/CachingHostAllocator.h; see also pytorch/pytorch#150517). Since the rounding is per tensor, the overshoot compounds. For our gemma-4 setup the pool is split into 5 CPU tensors: budgeting 60 GB gives ~12 GB per tensor, and PowerOf2Ceil(12 GB) = 16 GB, so the host actually allocates 5 × 16 = 80 GB. In the worst case (each tensor just above a power-of-two boundary) this approaches a 2x blow-up — under a fixed memory ceiling that either wastes a large fraction of the cache budget or pushes us into OOM.

SimpleCPUOffloadConnector avoids this by construction: it allocates plain (unpinned) memory and then pins it with cudaHostRegister, bypassing the caching host allocator entirely, so the pool is exactly the size we configure (55 GB stays 55 GB, not 64 GB). For a tight memory budget that exact sizing is a concrete operational reason we rely on this connector — and why the correctness fix in this PR matters for us.

@ivanium ivanium left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! Overall good; left a few comments

Comment thread vllm/v1/simple_kv_offload/cuda_mem_ops.py Outdated
Comment thread vllm/v1/simple_kv_offload/worker.py Outdated
Comment thread vllm/v1/simple_kv_offload/worker.py Outdated
Comment thread vllm/v1/simple_kv_offload/worker.py Outdated
@github-project-automation github-project-automation Bot moved this to In review in NVIDIA Jun 22, 2026
Reuse a single compute-done event across steps instead of allocating one
per step, and move the load/store ordering rationale from inline comments
into the get_finished docstring.

Keep srcAccessOrder=STREAM for GPU->CPU stores (the source is the live KV
cache; ANY may prefetch the source before the wait_event barrier fires),
matching what OffloadingConnector landed in vllm-project#39306. Refresh the stale
BatchMemcpyParams comment and reference the root issue vllm-project#45704.

Signed-off-by: Saddss <28726669061@qq.com>

@ivanium ivanium left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the effort! LGTM in general and I'd approve. I do suggest making another pass of comments to be more concise because agents tend to comment too much. I commented on two major ones but some other comment blocks worth double check too. Otherwise all good.

Comment thread vllm/v1/simple_kv_offload/copy_backend.py Outdated
Comment thread vllm/v1/simple_kv_offload/cuda_mem_ops.py Outdated
Comment thread vllm/v1/simple_kv_offload/cuda_mem_ops.py Outdated
Drop the duplicated srcAccessOrder rationale (kept only on the enum
definition), remove the copy-loop gate comment, and condense the store
event and STREAM/ANY notes. No functional change.

Signed-off-by: Saddss <28726669061@qq.com>
@ivanium ivanium added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 22, 2026
The STREAM-for-stores/ANY-for-loads mapping was restated in the params
struct and build_params on top of the call site. Keep the per-direction
rationale only at the decision point (copy_backend) and the semantics on
the enum; drop the two restatements. No functional change.

Signed-off-by: Saddss <28726669061@qq.com>
@Saddss

Saddss commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the effort! LGTM in general and I'd approve. I do suggest making another pass of comments to be more concise because agents tend to comment too much. I commented on two major ones but some other comment blocks worth double check too. Otherwise all good.

You're right, and thanks for calling it out. To be transparent about our workflow: we scope the problem and decide the fix direction, then have an agent drive the actual implementation and testing. It moves fast, but over-commenting is one of the tells, as you spotted. Did another pass: removed the gate comment in the copy loop, de-duplicated the STREAM/ANY rationale so it lives in a single place (semantics on the enum, the per-direction choice at the call site), and tightened the rest. Going forward I'll keep comments lean from the start instead of leaning on a review round to trim them. Appreciate the patient review.

@ivanium

ivanium commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

Thanks for the effort! LGTM in general and I'd approve. I do suggest making another pass of comments to be more concise because agents tend to comment too much. I commented on two major ones but some other comment blocks worth double check too. Otherwise all good.

You're right, and thanks for calling it out. To be transparent about our workflow: we scope the problem and decide the fix direction, then have an agent drive the actual implementation and testing. It moves fast, but over-commenting is one of the tells, as you spotted. Did another pass: removed the gate comment in the copy loop, de-duplicated the STREAM/ANY rationale so it lives in a single place (semantics on the enum, the per-direction choice at the call site), and tightened the rest. Going forward I'll keep comments lean from the start instead of leaning on a review round to trim them. Appreciate the patient review.

No worries and thanks for the nice fix! I have enabled CI and it should be merged once CI passes which I don't anticipate any issues

@Saddss

Saddss commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the effort! LGTM in general and I'd approve. I do suggest making another pass of comments to be more concise because agents tend to comment too much. I commented on two major ones but some other comment blocks worth double check too. Otherwise all good.

You're right, and thanks for calling it out. To be transparent about our workflow: we scope the problem and decide the fix direction, then have an agent drive the actual implementation and testing. It moves fast, but over-commenting is one of the tells, as you spotted. Did another pass: removed the gate comment in the copy loop, de-duplicated the STREAM/ANY rationale so it lives in a single place (semantics on the enum, the per-direction choice at the call site), and tightened the rest. Going forward I'll keep comments lean from the start instead of leaning on a review round to trim them. Appreciate the patient review.

No worries and thanks for the nice fix! I have enabled CI and it should be merged once CI passes which I don't anticipate any issues

There has only two failures, and both look unrelated to this PR:
e2e-core-1-gpu: test_cascade_attention[FLASH_ATTN] — minor output string mismatch
metrics-tracing-2-gpus: test_traces — SIGSEGV after generation finished
Could you retry those two failed jobs? @ivanium

@Saddss

Saddss commented Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the effort! LGTM in general and I'd approve. I do suggest making another pass of comments to be more concise because agents tend to comment too much. I commented on two major ones but some other comment blocks worth double check too. Otherwise all good.

You're right, and thanks for calling it out. To be transparent about our workflow: we scope the problem and decide the fix direction, then have an agent drive the actual implementation and testing. It moves fast, but over-commenting is one of the tells, as you spotted. Did another pass: removed the gate comment in the copy loop, de-duplicated the STREAM/ANY rationale so it lives in a single place (semantics on the enum, the per-direction choice at the call site), and tightened the rest. Going forward I'll keep comments lean from the start instead of leaning on a review round to trim them. Appreciate the patient review.

No worries and thanks for the nice fix! I have enabled CI and it should be merged once CI passes which I don't anticipate any issues

There has only two failures, and both look unrelated to this PR: e2e-core-1-gpu: test_cascade_attention[FLASH_ATTN] — minor output string mismatch metrics-tracing-2-gpus: test_traces — SIGSEGV after generation finished Could you retry those two failed jobs? @ivanium

CI is all green on. Could you help get it merged?Thanks😊

@github-project-automation github-project-automation Bot moved this from In review to Ready in NVIDIA Jun 22, 2026
@WoosukKwon WoosukKwon merged commit 82ede09 into vllm-project:main Jun 22, 2026
68 checks passed
@github-project-automation github-project-automation Bot moved this from Ready to Done in NVIDIA Jun 22, 2026
@Saddss Saddss deleted the fix/simple-cpu-offload-store-cross-stream-sync branch June 23, 2026 03:12
Saddss pushed a commit to FlowGPT/vllm that referenced this pull request Jun 23, 2026
Order GPU->CPU stores after a compute-done event and use STREAM
srcAccessOrder for stores (ANY for loads). Fixes silent KV corruption
under concurrent load reported in vllm-project#45704 (upstream vllm-project#46278).
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
qli88 pushed a commit to qli88/vllm that referenced this pull request Jun 26, 2026
Change72 added a commit to Change72/vllm that referenced this pull request Jul 1, 2026
Wire SimpleCPUOffloadConnector into the offloading stats / labeled-metrics
API so its CPU KV transfers report through the shared vllm:kv_offload_*
family (store/load bytes/time/size), plus per-connector CPU-pool gauges
(vllm:simple_cpu_offload_*: total/free/used blocks, usage, pending
loads/stores).

Transfer timing brackets only the DMA: host-side batch-argument preparation
is split out of copy_blocks() into prepare_copy()/launch_prepared_copy() and
runs before the start CUDA event, and the vllm-project#46278 compute-done wait stays
ordered before it, so kv_offload_*_time excludes host enqueue and
compute-wait latency. Timing events are recycled via a small pool.

Supersedes vllm-project#41790 (built on the pre-vllm-project#45957 metrics API); preserves vllm-project#43877
(scheduler post-completion stats aggregation) and vllm-project#46278 (GPU->CPU store
barrier + srcAccessOrder).

Tests: tests/v1/simple_kv_offload/test_metrics.py covers stats
helpers/reduce, prom observe() end-to-end, serialized round-trip, scheduler
pool gauges, worker transfer recording + reset-on-read, and event-pool reuse
(no GPU required). AI assistance was used; every line was human-reviewed.

Co-authored-by: OCWC22 <OCWC22@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Change72 <cguo51@asu.edu>
Change72 added a commit to Change72/vllm that referenced this pull request Jul 1, 2026
Wire SimpleCPUOffloadConnector into the offloading stats / labeled-metrics
API so its CPU KV transfers report through the shared vllm:kv_offload_*
family (store/load bytes/time/size), plus per-connector CPU-pool gauges
(vllm:simple_cpu_offload_*: total/free/used blocks, usage, pending
loads/stores).

Transfer timing brackets only the DMA: host-side batch-argument preparation
is split out of copy_blocks() into prepare_copy()/launch_prepared_copy() and
runs before the start CUDA event; the vllm-project#46278 compute-done wait is enqueued
before the prepare so it captures the shared compute event's current state
and stays outside the timing bracket. So kv_offload_*_time excludes host
enqueue and compute-wait latency. Timing events are recycled via a pool.

Supersedes vllm-project#41790 (built on the pre-vllm-project#45957 metrics API); preserves vllm-project#43877
(scheduler post-completion stats aggregation) and vllm-project#46278 (GPU->CPU store
barrier + srcAccessOrder).

Tests: tests/v1/simple_kv_offload/test_metrics.py covers stats build/reduce,
prom observe() end-to-end (transfer + pool gauges), scheduler pool gauges,
worker transfer recording + reset-on-read, and event-pool reuse (no GPU
required); tests/v1/simple_kv_offload/test_worker.py store-ordering still
passes on GPU. AI assistance was used.

Co-authored-by: OCWC22 <OCWC22@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Change72 <cguo51@asu.edu>
Change72 added a commit to Change72/vllm that referenced this pull request Jul 1, 2026
Wire SimpleCPUOffloadConnector into the offloading stats / labeled-metrics
API so its CPU KV transfers report through the shared vllm:kv_offload_*
family (store/load bytes/time/size), plus per-connector CPU-pool gauges
(vllm:simple_cpu_offload_*: total/free/used blocks, usage, pending
loads/stores).

Transfer timing brackets only the DMA: host-side batch-argument preparation
is split out of copy_blocks() into prepare_copy()/launch_prepared_copy() and
runs before the start CUDA event; the vllm-project#46278 compute-done wait is enqueued
before the prepare so it captures the shared compute event's current state
and stays outside the timing bracket. So kv_offload_*_time excludes host
enqueue and compute-wait latency. Timing events are recycled via a pool.

Supersedes vllm-project#41790 (built on the pre-vllm-project#45957 metrics API); preserves vllm-project#43877
(scheduler post-completion stats aggregation) and vllm-project#46278 (GPU->CPU store
barrier + srcAccessOrder).

Tests: tests/v1/simple_kv_offload/test_metrics.py covers stats build/reduce,
prom observe() end-to-end (transfer + pool gauges), scheduler pool gauges,
worker transfer recording + reset-on-read, and event-pool reuse (no GPU
required); tests/v1/simple_kv_offload/test_worker.py store-ordering still
passes on GPU. AI assistance was used.

Co-authored-by: OCWC22 <OCWC22@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Signed-off-by: Change72 <cguo51@asu.edu>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working nvidia ready ONLY add when PR is ready to merge/full CI is needed v1

4 participants