[DSv4] Pack KV caches into contiguous per-block allocations for DeepSeek V4#44577
Conversation
728002d to
4e68ead
Compare
e92a466 to
83245d3
Compare
…eek V4 DeepSeek V4 uses mixed page sizes across layers (37440B, 8640B, 1728B), which previously resulted in separate GPU allocations per layer. This created ~92 distinct memory regions that NIXL had to register for P2P KV transfer. Pack all DSv4 KV cache layers into a single contiguous backing tensor with per-layer views at computed offsets. This reduces the number of P2P memory regions from ~92 to ~31, improving KV transfer setup time and reducing RDMA registration overhead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
83245d3 to
0a55d7e
Compare
7368635 to
ff89a84
Compare
Interleave all DSv4 KV cache layers per block so each block's data across all layers is contiguous in memory. Attention kernels use torch.as_strided views to see per-layer slices. NIXL registers the packed allocation as 1 region instead of ~31, reducing scatter/gather entries per block transfer from 31 to 1. Changes: - KVCacheTensor: backing_size → block_stride (intra-block offset + stride) - _get_kv_cache_config_deepseek_v4: interleaved layout with block_stride - _allocate_kv_cache / _reshape_kv_cache: strided views via as_strided - register_cross_layers_kv_cache: add block_stride param - NIXL worker: _register_packed_kv_cache for single-region registration - MRv2 model runner + kv_connector: pass packed backing through Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
a435b71 to
4f0f271
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
|
edit: We're looking into reverting back to dd9063b now |
|
Hi @tlrmchlsmth, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
| # buckets = {page_size: [[layer_names], [layer_names], ...]} | ||
| buckets = _bucket_layers_by_page_size(kv_cache_groups) | ||
| total_num_bytes_per_block = sum(ps * len(slots) for ps, slots in buckets.items()) | ||
| full_mla_spec = kv_cache_groups[0].kv_cache_spec |
There was a problem hiding this comment.
This means page_sizes only has the sizes from group 0. Shouldn't we consider all groups?
|
|
||
| raw_tensor = kv_cache_raw_tensors[layer_name].view(dtype) | ||
| if kv_cache_spec.page_size_padded is not None: | ||
| if packing is not None: |
There was a problem hiding this comment.
This logic is also used in https://github.com/vllm-project/vllm/pull/44577/changes#diff-9b864c13232e1f03b906ccc83311fa78d1c37988616ae88b4afdbb5c0d186a75R257, a helper would be good
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
4ff2195 to
25e44d6
Compare
| layer_offset, blk_stride = packing | ||
| dtype_size = get_dtype_size(dtype) | ||
| page_stride = blk_stride // dtype_size | ||
| strides = list(torch.empty(kv_cache_shape).stride()) |
There was a problem hiding this comment.
maybe a meta tensor to avoid a real-allocation?
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
LucasWilkinson
left a comment
There was a problem hiding this comment.
LGTM, thanks for doing this!
- Add cache_config.num_gpu_blocks_override to test mock (was only setting scheduler_config, causing MagicMock to corrupt num_blocks) - Guard packed KV cache detection against non-tensor values in register_kv_caches (hybrid SSM models pass lists) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
| self.device_kv_caches = kv_caches | ||
| return | ||
| first_val = next(iter(kv_caches.values())) | ||
| if isinstance(first_val, torch.Tensor): |
There was a problem hiding this comment.
why is this if needed? seems like a violation of the type signature kv_caches: dict[str, torch.Tensor]
There was a problem hiding this comment.
looking - I'm a little suspicious as well
There was a problem hiding this comment.
_reshape_kv_cache_tensors is lying about its return type here?
vllm/vllm/v1/worker/gpu_model_runner.py
Line 7215 in 9bebb92
|
Hi @tlrmchlsmth, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
Address review feedback: the isinstance(first_val, torch.Tensor) check was a type-safety violation. Use the existing _has_mamba flag instead, which correctly skips packed detection for hybrid SSM models whose kv_caches values are lists of tensors. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
60aaa7f to
2d2cd3b
Compare
Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
…eek V4 (vllm-project#44577) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: OpenAI Codex <codex@openai.com>
…eek V4 (vllm-project#44577) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: OpenAI Codex <codex@openai.com>
…eek V4 (vllm-project#44577) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: OpenAI Codex <codex@openai.com>
…eek V4 (vllm-project#44577) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: Qiang Li <qiang.li2@amd.com>
For DeepSeek V4, pack all layer data contiguously per block so that KV connectors can send/receive one region per block.
Full-attention MLA + SWA/compressor caches share one contiguous allocation per block. Each layer gets an
as_stridedview with storage_offset into the packed backing tensor.The packed backing tensor and block_stride are passed through the cross-layer KV cache registration API so NIXL registers one region with block_len=block_stride instead of many separate regions.
Previously: 92 NIXL regions, 92 tiny P2P transfers per block (~16KB).
Now: 1 region, 1 large RDMA transfer per block (~1.48MB).