Skip to content

[DSv4] Pack KV caches into contiguous per-block allocations for DeepSeek V4#44577

Merged
tlrmchlsmth merged 15 commits into
vllm-project:mainfrom
tlrmchlsmth:cross_layer_dsv4
Jun 19, 2026
Merged

[DSv4] Pack KV caches into contiguous per-block allocations for DeepSeek V4#44577
tlrmchlsmth merged 15 commits into
vllm-project:mainfrom
tlrmchlsmth:cross_layer_dsv4

Conversation

@tlrmchlsmth

Copy link
Copy Markdown
Member

For DeepSeek V4, pack all layer data contiguously per block so that KV connectors can send/receive one region per block.

Full-attention MLA + SWA/compressor caches share one contiguous allocation per block. Each layer gets an as_strided view with storage_offset into the packed backing tensor.

The packed backing tensor and block_stride are passed through the cross-layer KV cache registration API so NIXL registers one region with block_len=block_stride instead of many separate regions.

Previously: 92 NIXL regions, 92 tiny P2P transfers per block (~16KB).
Now: 1 region, 1 large RDMA transfer per block (~1.48MB).

@mergify mergify Bot added deepseek Related to DeepSeek models v1 kv-connector labels Jun 4, 2026
@tlrmchlsmth tlrmchlsmth force-pushed the cross_layer_dsv4 branch 2 times, most recently from 728002d to 4e68ead Compare June 5, 2026 18:59
@tlrmchlsmth tlrmchlsmth force-pushed the cross_layer_dsv4 branch 3 times, most recently from e92a466 to 83245d3 Compare June 5, 2026 20:07
…eek V4

DeepSeek V4 uses mixed page sizes across layers (37440B, 8640B, 1728B),
which previously resulted in separate GPU allocations per layer. This
created ~92 distinct memory regions that NIXL had to register for P2P
KV transfer.

Pack all DSv4 KV cache layers into a single contiguous backing tensor
with per-layer views at computed offsets. This reduces the number of
P2P memory regions from ~92 to ~31, improving KV transfer setup time
and reducing RDMA registration overhead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Comment thread vllm/v1/worker/gpu_model_runner.py Outdated
Comment thread vllm/v1/worker/gpu_model_runner.py Outdated
Comment thread vllm/v1/worker/gpu/attn_utils.py Outdated
Interleave all DSv4 KV cache layers per block so each block's data
across all layers is contiguous in memory. Attention kernels use
torch.as_strided views to see per-layer slices. NIXL registers the
packed allocation as 1 region instead of ~31, reducing scatter/gather
entries per block transfer from 31 to 1.

Changes:
- KVCacheTensor: backing_size → block_stride (intra-block offset + stride)
- _get_kv_cache_config_deepseek_v4: interleaved layout with block_stride
- _allocate_kv_cache / _reshape_kv_cache: strided views via as_strided
- register_cross_layers_kv_cache: add block_stride param
- NIXL worker: _register_packed_kv_cache for single-region registration
- MRv2 model runner + kv_connector: pass packed backing through

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
@mergify

mergify Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tlrmchlsmth.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@tlrmchlsmth

tlrmchlsmth commented Jun 18, 2026

Copy link
Copy Markdown
Member Author

@ivanium could you take a look at the latest revision? @MatthewBonanni and I were looking yesterday, and my understanding is we were using not only the same backing tensor for the KV cache, but for each individual block as well. So the SWA state was growing linearly with the seq len.

I was trying to fix it in 64b7c89 but obviously botched

edit: We're looking into reverting back to dd9063b now

@mergify

mergify Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Hi @tlrmchlsmth, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Comment thread vllm/v1/core/kv_cache_utils.py Outdated
# buckets = {page_size: [[layer_names], [layer_names], ...]}
buckets = _bucket_layers_by_page_size(kv_cache_groups)
total_num_bytes_per_block = sum(ps * len(slots) for ps, slots in buckets.items())
full_mla_spec = kv_cache_groups[0].kv_cache_spec

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means page_sizes only has the sizes from group 0. Shouldn't we consider all groups?


raw_tensor = kv_cache_raw_tensors[layer_name].view(dtype)
if kv_cache_spec.page_size_padded is not None:
if packing is not None:

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Comment thread vllm/v1/worker/gpu_model_runner.py Outdated
layer_offset, blk_stride = packing
dtype_size = get_dtype_size(dtype)
page_stride = blk_stride // dtype_size
strides = list(torch.empty(kv_cache_shape).stride())

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a meta tensor to avoid a real-allocation?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternative tlrmchlsmth#35

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

@LucasWilkinson LucasWilkinson left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for doing this!

@github-project-automation github-project-automation Bot moved this to Ready in NVIDIA Jun 18, 2026
- Add cache_config.num_gpu_blocks_override to test mock (was only
  setting scheduler_config, causing MagicMock to corrupt num_blocks)
- Guard packed KV cache detection against non-tensor values in
  register_kv_caches (hybrid SSM models pass lists)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
self.device_kv_caches = kv_caches
return
first_val = next(iter(kv_caches.values()))
if isinstance(first_val, torch.Tensor):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this if needed? seems like a violation of the type signature kv_caches: dict[str, torch.Tensor]

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking - I'm a little suspicious as well

@tlrmchlsmth tlrmchlsmth Jun 18, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_reshape_kv_cache_tensors is lying about its return type here?

kv_caches[layer_name] = state_tensors

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been that way for a while: #19327

@mergify

mergify Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Hi @tlrmchlsmth, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Address review feedback: the isinstance(first_val, torch.Tensor) check
was a type-safety violation. Use the existing _has_mamba flag instead,
which correctly skips packed detection for hybrid SSM models whose
kv_caches values are lists of tensors.

Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
@tlrmchlsmth tlrmchlsmth merged commit 0119213 into vllm-project:main Jun 19, 2026
96 checks passed
@github-project-automation github-project-automation Bot moved this from Ready to Done in NVIDIA Jun 19, 2026
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Jun 21, 2026
…eek V4 (vllm-project#44577)

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026
…eek V4 (vllm-project#44577)

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
…eek V4 (vllm-project#44577)

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
qli88 pushed a commit to qli88/vllm that referenced this pull request Jun 26, 2026
…eek V4 (vllm-project#44577)

Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
Signed-off-by: Qiang Li <qiang.li2@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build deepseek Related to DeepSeek models kv-connector nvidia ready ONLY add when PR is ready to merge/full CI is needed v1

5 participants