[DSv4] Pack KV caches into contiguous per-block allocations for DeepSeek V4 by tlrmchlsmth · Pull Request #44577 · vllm-project/vllm

tlrmchlsmth · 2026-06-04T20:37:18Z

For DeepSeek V4, pack all layer data contiguously per block so that KV connectors can send/receive one region per block.

Full-attention MLA + SWA/compressor caches share one contiguous allocation per block. Each layer gets an as_strided view with storage_offset into the packed backing tensor.

The packed backing tensor and block_stride are passed through the cross-layer KV cache registration API so NIXL registers one region with block_len=block_stride instead of many separate regions.

Previously: 92 NIXL regions, 92 tiny P2P transfers per block (~16KB).
Now: 1 region, 1 large RDMA transfer per block (~1.48MB).

…eek V4 DeepSeek V4 uses mixed page sizes across layers (37440B, 8640B, 1728B), which previously resulted in separate GPU allocations per layer. This created ~92 distinct memory regions that NIXL had to register for P2P KV transfer. Pack all DSv4 KV cache layers into a single contiguous backing tensor with per-layer views at computed offsets. This reduces the number of P2P memory regions from ~92 to ~31, improving KV transfer setup time and reducing RDMA registration overhead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Interleave all DSv4 KV cache layers per block so each block's data across all layers is contiguous in memory. Attention kernels use torch.as_strided views to see per-layer slices. NIXL registers the packed allocation as 1 region instead of ~31, reducing scatter/gather entries per block transfer from 31 to 1. Changes: - KVCacheTensor: backing_size → block_stride (intra-block offset + stride) - _get_kv_cache_config_deepseek_v4: interleaved layout with block_stride - _allocate_kv_cache / _reshape_kv_cache: strided views via as_strided - register_cross_layers_kv_cache: add block_stride param - NIXL worker: _register_packed_kv_cache for single-region registration - MRv2 model runner + kv_connector: pass packed backing through Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

mergify · 2026-06-07T15:08:52Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tlrmchlsmth.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

tlrmchlsmth · 2026-06-18T13:33:45Z

@ivanium could you take a look at the latest revision? @MatthewBonanni and I were looking yesterday, and my understanding is we were using not only the same backing tensor for the KV cache, but for each individual block as well. So the SWA state was growing linearly with the seq len.

~~I was trying to fix it in 64b7c89 but obviously botched~~

edit: We're looking into reverting back to dd9063b now

mergify · 2026-06-18T13:56:39Z

Hi @tlrmchlsmth, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

MatthewBonanni · 2026-06-18T14:34:12Z

-    # buckets = {page_size: [[layer_names], [layer_names], ...]}
-    buckets = _bucket_layers_by_page_size(kv_cache_groups)
-    total_num_bytes_per_block = sum(ps * len(slots) for ps, slots in buckets.items())
+    full_mla_spec = kv_cache_groups[0].kv_cache_spec


This means page_sizes only has the sizes from group 0. Shouldn't we consider all groups?

MatthewBonanni · 2026-06-18T14:35:31Z


                    raw_tensor = kv_cache_raw_tensors[layer_name].view(dtype)
-                    if kv_cache_spec.page_size_padded is not None:
+                    if packing is not None:


This logic is also used in https://github.com/vllm-project/vllm/pull/44577/changes#diff-9b864c13232e1f03b906ccc83311fa78d1c37988616ae88b4afdbb5c0d186a75R257, a helper would be good

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

LucasWilkinson · 2026-06-18T17:59:33Z

+                        layer_offset, blk_stride = packing
+                        dtype_size = get_dtype_size(dtype)
+                        page_stride = blk_stride // dtype_size
+                        strides = list(torch.empty(kv_cache_shape).stride())


maybe a meta tensor to avoid a real-allocation?

Alternative tlrmchlsmth#35

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

LucasWilkinson

LGTM, thanks for doing this!

- Add cache_config.num_gpu_blocks_override to test mock (was only setting scheduler_config, causing MagicMock to corrupt num_blocks) - Guard packed KV cache detection against non-tensor values in register_kv_caches (hybrid SSM models pass lists) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

LucasWilkinson · 2026-06-18T21:55:18Z

-                self.device_kv_caches = kv_caches
-                return
+            first_val = next(iter(kv_caches.values()))
+            if isinstance(first_val, torch.Tensor):


why is this if needed? seems like a violation of the type signature kv_caches: dict[str, torch.Tensor]

looking - I'm a little suspicious as well

_reshape_kv_cache_tensors is lying about its return type here?

vllm/vllm/v1/worker/gpu_model_runner.py

Line 7215 in 9bebb92

kv_caches[layer_name] = state_tensors

It's been that way for a while: #19327

mergify · 2026-06-18T22:51:29Z

Hi @tlrmchlsmth, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Address review feedback: the isinstance(first_val, torch.Tensor) check was a type-safety violation. Use the existing _has_mamba flag instead, which correctly skips packed detection for hybrid SSM models whose kv_caches values are lists of tensors. Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

…eek V4 (vllm-project#44577) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: OpenAI Codex <codex@openai.com>

…eek V4 (vllm-project#44577) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: Qiang Li <qiang.li2@amd.com>

mergify Bot added deepseek Related to DeepSeek models v1 kv-connector labels Jun 4, 2026

tlrmchlsmth force-pushed the cross_layer_dsv4 branch 2 times, most recently from 728002d to 4e68ead Compare June 5, 2026 18:59

mergify Bot added ci/build nvidia labels Jun 5, 2026

github-project-automation Bot added this to NVIDIA Jun 5, 2026

tlrmchlsmth force-pushed the cross_layer_dsv4 branch 3 times, most recently from e92a466 to 83245d3 Compare June 5, 2026 20:07

tlrmchlsmth force-pushed the cross_layer_dsv4 branch from 83245d3 to 0a55d7e Compare June 5, 2026 20:21

tlrmchlsmth marked this pull request as ready for review June 5, 2026 21:14

tlrmchlsmth requested review from ApostaC, WoosukKwon, alexm-redhat, heheda12345, njhill, orozery, robertgshaw2-redhat and ywang96 as code owners June 5, 2026 21:14

claude Bot reviewed Jun 5, 2026

View reviewed changes

LucasWilkinson reviewed Jun 6, 2026

View reviewed changes

Comment thread vllm/v1/worker/gpu_model_runner.py Outdated

Comment thread vllm/v1/worker/gpu_model_runner.py Outdated

Comment thread vllm/v1/worker/gpu/attn_utils.py Outdated

tlrmchlsmth force-pushed the cross_layer_dsv4 branch from 7368635 to ff89a84 Compare June 6, 2026 19:38

tlrmchlsmth requested review from NickLucche and xuechendi as code owners June 6, 2026 19:38

tlrmchlsmth force-pushed the cross_layer_dsv4 branch from a435b71 to 4f0f271 Compare June 6, 2026 20:49

MatthewBonanni reviewed Jun 18, 2026

View reviewed changes

ZhanqiuHu mentioned this pull request Jun 18, 2026

[RFC]: Kernel-agnostic constant-stride (layer-major) KV cache layout for KV connectors #45997

Open

tlrmchlsmth added 2 commits June 18, 2026 12:17

Merge remote-tracking branch 'origin/main' into cross_layer_dsv4

6111c4b

remove test

25e44d6

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

tlrmchlsmth force-pushed the cross_layer_dsv4 branch from 4ff2195 to 25e44d6 Compare June 18, 2026 16:19

LucasWilkinson reviewed Jun 18, 2026

View reviewed changes

LucasWilkinson mentioned this pull request Jun 18, 2026

Simplify packed KV cache handling tlrmchlsmth/vllm#35

Merged

Simplify packed KV cache handling (#35)

9bebb92

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

LucasWilkinson approved these changes Jun 18, 2026

View reviewed changes

github-project-automation Bot moved this to Ready in NVIDIA Jun 18, 2026

LucasWilkinson reviewed Jun 18, 2026

View reviewed changes

tlrmchlsmth force-pushed the cross_layer_dsv4 branch from 60aaa7f to 2d2cd3b Compare June 18, 2026 23:10

Fix contiguous KV packing view test

56c4718

Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

tlrmchlsmth merged commit 0119213 into vllm-project:main Jun 19, 2026
96 checks passed

github-project-automation Bot moved this from Ready to Done in NVIDIA Jun 19, 2026

LucasWilkinson mentioned this pull request Jun 20, 2026

[KV Offload] Support packed HMA KV cache layout #46205

Merged

tjtanaa mentioned this pull request Jun 20, 2026

[ROCm] [Bugfix] Bugfix ROCm Sparse Indexer #46222

Merged

4 tasks

majian4work mentioned this pull request Jun 25, 2026

[XPU][Bugfix] Disable packed KV cache allocation on XPU for DeepSeek-V4 #46681

Closed

varun-sundar-rabindranath mentioned this pull request Jun 27, 2026

[KV-Offloading] Fix tensors_per_block stride #46888

Merged

majunze2001 mentioned this pull request Jul 3, 2026

[Bugfix] DSV4 TP16 garbage output #47493

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[DSv4] Pack KV caches into contiguous per-block allocations for DeepSeek V4#44577

[DSv4] Pack KV caches into contiguous per-block allocations for DeepSeek V4#44577
tlrmchlsmth merged 15 commits into
vllm-project:mainfrom
tlrmchlsmth:cross_layer_dsv4

tlrmchlsmth commented Jun 4, 2026

claude Bot left a comment

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented Jun 7, 2026

tlrmchlsmth commented Jun 18, 2026 •

edited

Loading

mergify Bot commented Jun 18, 2026

MatthewBonanni Jun 18, 2026

MatthewBonanni Jun 18, 2026

LucasWilkinson Jun 18, 2026

LucasWilkinson Jun 18, 2026

LucasWilkinson left a comment

LucasWilkinson Jun 18, 2026

tlrmchlsmth Jun 18, 2026

tlrmchlsmth Jun 18, 2026 •

edited

Loading

tlrmchlsmth Jun 18, 2026

mergify Bot commented Jun 18, 2026

Uh oh!

Labels

5 participants

Uh oh!

Uh oh!

Conversation

tlrmchlsmth commented Jun 4, 2026

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented Jun 7, 2026

tlrmchlsmth commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

mergify Bot commented Jun 18, 2026

MatthewBonanni Jun 18, 2026

Choose a reason for hiding this comment

MatthewBonanni Jun 18, 2026

Choose a reason for hiding this comment

LucasWilkinson Jun 18, 2026

Choose a reason for hiding this comment

LucasWilkinson Jun 18, 2026

Choose a reason for hiding this comment

LucasWilkinson left a comment

Choose a reason for hiding this comment

LucasWilkinson Jun 18, 2026

Choose a reason for hiding this comment

tlrmchlsmth Jun 18, 2026

Choose a reason for hiding this comment

tlrmchlsmth Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

tlrmchlsmth Jun 18, 2026

Choose a reason for hiding this comment

mergify Bot commented Jun 18, 2026

Uh oh!

Labels

5 participants

tlrmchlsmth commented Jun 18, 2026 •

edited

Loading

tlrmchlsmth Jun 18, 2026 •

edited

Loading