[Bugfix][KVConnector] Fix SimpleCPUOffloadConnector GPU->CPU store race#46278
Conversation
SimpleCPUOffloadConnector defers both loads and stores to get_finished(), which runs after model execution, and assumes that blocks whose tokens are host-side "confirmed" are also safe to read on the device. Under v1 overlapped execution the host runs ahead of the GPU, so the KV-write kernels for those blocks may still be in flight on the compute stream when the GPU->CPU store is launched on a separate copy stream. The store can then read partially written or stale KV blocks and silently corrupt the CPU cache, producing garbled output on later CPU cache hits. Order the store after the compute stream: record a compute-done event on the current stream in get_finished() and have the copy backend wait on it before issuing the GPU->CPU copy. Stores now use srcAccessOrder=STREAM (the source is the live KV cache); loads keep srcAccessOrder=ANY (the source is stable pinned host memory) and still launch immediately with no cross-stream barrier. Signed-off-by: Saddss <28726669061@qq.com>
|
Hi, thanks for the contribution, I wonder do you guys ever try using mooncake store connector? |
Thanks for the pointer! Yes, we're aware of MooncakeStoreConnector and it's a great piece of work — for setups with RDMA it clearly offers more (cross-instance prefix sharing, SSD tiering, GPUDirect transfers). Our deployment is a bit of an awkward fit for it, though: the fleet is all RTX 5090 with no RDMA NICs. Mooncake's main advantage is the RDMA / GPUDirect path, and without an RDMA HCA the transfer engine falls back to TCP. We did test Mooncake as a shared DRAM pool, but over TCP the transfer bandwidth couldn't keep up with recompute — a KV cache hit ended up slower than just recomputing the prefix, so it was a net loss for us. For a single-node, no-RDMA setup, SimpleCPUOffloadConnector fits well: it does a direct cuMemcpyBatchAsync between GPU HBM and local pinned host memory over PCIe, with no network stack in the path, so it saturates PCIe bandwidth. |
Another reason we depend on SimpleCPUOffloadConnector specifically is precise host memory allocation. Our nodes have a hard whole-machine memory budget, so the CPU KV pool has to use exactly what we configure in order to maximize usable cache. The legacy CPU offload path can't give us that, because of PyTorch's pinned caching allocator. It allocates the host pool as one torch.zeros(..., pin_memory=True) per KV cache tensor (vllm/v1/kv_offload/cpu/gpu_worker.py), and that allocator rounds every allocation up to the next power of two (roundSize = c10::llvm::PowerOf2Ceil(size) in aten/src/ATen/core/CachingHostAllocator.h; see also pytorch/pytorch#150517). Since the rounding is per tensor, the overshoot compounds. For our gemma-4 setup the pool is split into 5 CPU tensors: budgeting 60 GB gives ~12 GB per tensor, and PowerOf2Ceil(12 GB) = 16 GB, so the host actually allocates 5 × 16 = 80 GB. In the worst case (each tensor just above a power-of-two boundary) this approaches a 2x blow-up — under a fixed memory ceiling that either wastes a large fraction of the cache budget or pushes us into OOM. SimpleCPUOffloadConnector avoids this by construction: it allocates plain (unpinned) memory and then pins it with cudaHostRegister, bypassing the caching host allocator entirely, so the pool is exactly the size we configure (55 GB stays 55 GB, not 64 GB). For a tight memory budget that exact sizing is a concrete operational reason we rely on this connector — and why the correctness fix in this PR matters for us. |
ivanium
left a comment
There was a problem hiding this comment.
Thanks for the fix! Overall good; left a few comments
Reuse a single compute-done event across steps instead of allocating one per step, and move the load/store ordering rationale from inline comments into the get_finished docstring. Keep srcAccessOrder=STREAM for GPU->CPU stores (the source is the live KV cache; ANY may prefetch the source before the wait_event barrier fires), matching what OffloadingConnector landed in vllm-project#39306. Refresh the stale BatchMemcpyParams comment and reference the root issue vllm-project#45704. Signed-off-by: Saddss <28726669061@qq.com>
ivanium
left a comment
There was a problem hiding this comment.
Thanks for the effort! LGTM in general and I'd approve. I do suggest making another pass of comments to be more concise because agents tend to comment too much. I commented on two major ones but some other comment blocks worth double check too. Otherwise all good.
Drop the duplicated srcAccessOrder rationale (kept only on the enum definition), remove the copy-loop gate comment, and condense the store event and STREAM/ANY notes. No functional change. Signed-off-by: Saddss <28726669061@qq.com>
The STREAM-for-stores/ANY-for-loads mapping was restated in the params struct and build_params on top of the call site. Keep the per-direction rationale only at the decision point (copy_backend) and the semantics on the enum; drop the two restatements. No functional change. Signed-off-by: Saddss <28726669061@qq.com>
You're right, and thanks for calling it out. To be transparent about our workflow: we scope the problem and decide the fix direction, then have an agent drive the actual implementation and testing. It moves fast, but over-commenting is one of the tells, as you spotted. Did another pass: removed the gate comment in the copy loop, de-duplicated the STREAM/ANY rationale so it lives in a single place (semantics on the enum, the per-direction choice at the call site), and tightened the rest. Going forward I'll keep comments lean from the start instead of leaning on a review round to trim them. Appreciate the patient review. |
No worries and thanks for the nice fix! I have enabled CI and it should be merged once CI passes which I don't anticipate any issues |
There has only two failures, and both look unrelated to this PR: |
CI is all green on. Could you help get it merged?Thanks😊 |
Order GPU->CPU stores after a compute-done event and use STREAM srcAccessOrder for stores (ANY for loads). Fixes silent KV corruption under concurrent load reported in vllm-project#45704 (upstream vllm-project#46278).
…ce (vllm-project#46278) Signed-off-by: Qiang Li <qiang.li2@amd.com>
Wire SimpleCPUOffloadConnector into the offloading stats / labeled-metrics API so its CPU KV transfers report through the shared vllm:kv_offload_* family (store/load bytes/time/size), plus per-connector CPU-pool gauges (vllm:simple_cpu_offload_*: total/free/used blocks, usage, pending loads/stores). Transfer timing brackets only the DMA: host-side batch-argument preparation is split out of copy_blocks() into prepare_copy()/launch_prepared_copy() and runs before the start CUDA event, and the vllm-project#46278 compute-done wait stays ordered before it, so kv_offload_*_time excludes host enqueue and compute-wait latency. Timing events are recycled via a small pool. Supersedes vllm-project#41790 (built on the pre-vllm-project#45957 metrics API); preserves vllm-project#43877 (scheduler post-completion stats aggregation) and vllm-project#46278 (GPU->CPU store barrier + srcAccessOrder). Tests: tests/v1/simple_kv_offload/test_metrics.py covers stats helpers/reduce, prom observe() end-to-end, serialized round-trip, scheduler pool gauges, worker transfer recording + reset-on-read, and event-pool reuse (no GPU required). AI assistance was used; every line was human-reviewed. Co-authored-by: OCWC22 <OCWC22@users.noreply.github.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Change72 <cguo51@asu.edu>
Wire SimpleCPUOffloadConnector into the offloading stats / labeled-metrics API so its CPU KV transfers report through the shared vllm:kv_offload_* family (store/load bytes/time/size), plus per-connector CPU-pool gauges (vllm:simple_cpu_offload_*: total/free/used blocks, usage, pending loads/stores). Transfer timing brackets only the DMA: host-side batch-argument preparation is split out of copy_blocks() into prepare_copy()/launch_prepared_copy() and runs before the start CUDA event; the vllm-project#46278 compute-done wait is enqueued before the prepare so it captures the shared compute event's current state and stays outside the timing bracket. So kv_offload_*_time excludes host enqueue and compute-wait latency. Timing events are recycled via a pool. Supersedes vllm-project#41790 (built on the pre-vllm-project#45957 metrics API); preserves vllm-project#43877 (scheduler post-completion stats aggregation) and vllm-project#46278 (GPU->CPU store barrier + srcAccessOrder). Tests: tests/v1/simple_kv_offload/test_metrics.py covers stats build/reduce, prom observe() end-to-end (transfer + pool gauges), scheduler pool gauges, worker transfer recording + reset-on-read, and event-pool reuse (no GPU required); tests/v1/simple_kv_offload/test_worker.py store-ordering still passes on GPU. AI assistance was used. Co-authored-by: OCWC22 <OCWC22@users.noreply.github.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Change72 <cguo51@asu.edu>
Wire SimpleCPUOffloadConnector into the offloading stats / labeled-metrics API so its CPU KV transfers report through the shared vllm:kv_offload_* family (store/load bytes/time/size), plus per-connector CPU-pool gauges (vllm:simple_cpu_offload_*: total/free/used blocks, usage, pending loads/stores). Transfer timing brackets only the DMA: host-side batch-argument preparation is split out of copy_blocks() into prepare_copy()/launch_prepared_copy() and runs before the start CUDA event; the vllm-project#46278 compute-done wait is enqueued before the prepare so it captures the shared compute event's current state and stays outside the timing bracket. So kv_offload_*_time excludes host enqueue and compute-wait latency. Timing events are recycled via a pool. Supersedes vllm-project#41790 (built on the pre-vllm-project#45957 metrics API); preserves vllm-project#43877 (scheduler post-completion stats aggregation) and vllm-project#46278 (GPU->CPU store barrier + srcAccessOrder). Tests: tests/v1/simple_kv_offload/test_metrics.py covers stats build/reduce, prom observe() end-to-end (transfer + pool gauges), scheduler pool gauges, worker transfer recording + reset-on-read, and event-pool reuse (no GPU required); tests/v1/simple_kv_offload/test_worker.py store-ordering still passes on GPU. AI assistance was used. Co-authored-by: OCWC22 <OCWC22@users.noreply.github.com> Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Change72 <cguo51@asu.edu>
Purpose
Fixes #45704.
SimpleCPUOffloadConnectordefers both loads and stores toget_finished(), which runs after model execution. It assumes that blocks whose tokens are host-side "confirmed" by the scheduler are also safe to read on the device, and therefore launches the GPU->CPU store with no cross-stream synchronization.That assumption does not hold under v1 overlapped execution. The host runs ahead of the GPU, so the KV-write kernels for the blocks being stored may still be in flight on the compute stream when the store is launched on a separate copy stream. The copy backend additionally uses
srcAccessOrder=ANYfor every transfer, which allows the DMA engine to read the source before prior writes to it have completed.The combination lets the store read partially written or stale KV blocks and silently corrupt the CPU cache. There is no error or crash; the only symptom is garbled output on later requests that hit the corrupted CPU blocks.
This is the same class of hazard that #31341 fixed for
OffloadingConnector;SimpleCPUOffloadConnectorreintroduced it because it dropped the cross-stream barrier.Fix
get_finished()and have the copy backend wait on it before issuing the GPU->CPU store, so the store is ordered after the kernels that wrote those KV blocks.srcAccessOrder=STREAMfor stores (the source is the live KV cache) and keepsrcAccessOrder=ANYfor loads (the source is stable pinned host memory).The fix preserves the connector's existing design (precise
cudaHostRegisterpinning, batched DMA, background copy thread) and only adds the missing happens-before edge for stores.Test Plan
tests/v1/simple_kv_offload/test_worker.py:test_store_orders_after_compute_write: drivesDmaCopyBackenddirectly and usestorch.cuda._sleepto deterministically delay the KV write. Asserts both directions so it is self-validating: the no-barrier control must corrupt (proving the race window is exercised) and the fixed path must be clean.test_get_finished_passes_wait_event_for_store_only: assertsget_finished()passes a compute-done event for stores andNonefor loads.test_build_params_src_access_order: assertsbuild_paramsdefaults toANYand honors an explicitSTREAMoverride.pytest tests/v1/simple_kv_offload/test_worker.py -vTest Result
3 passedon the patched build. The same test file fails to even import against an unpatched build (the new symbols do not exist), confirming it covers the change.