[KV Offload] Gate packed HMA KV cache on cross-layer config by LucasWilkinson · Pull Request #46252 · vllm-project/vllm

LucasWilkinson · 2026-06-20T21:50:22Z

Summary

Use kv_connector_extra_config["enable_cross_layers_blocks"] to opt multi-group HMA layouts into packed KV allocation as laid out in https://docs.vllm.ai/en/stable/features/nixl_connector_usage/#cross-layers-blocks
Keep DeepSeek V4-style UniformTypeKVCacheSpecs layouts on the packed path by default.
Remove the extra VLLM_USE_PACKED_HMA_KV_CACHE environment flag.
Canonicalize packed KV caches in the offloading worker as one full-row tensor/ref per KV group.

Tests

UCX_TLS=cuda_ipc,cuda_copy,sm,tcp,self UCX_NET_DEVICES=all .venv/bin/python -m pytest tests/v1/core/test_contiguous_kv_packing.py tests/v1/kv_connector/unit/offloading_connector/test_worker.py tests/v1/kv_connector/unit/test_nixl_simple_cpu_offload.py -q
- 20 passed, 21 warnings
.venv/bin/pre-commit run ruff-check --files vllm/v1/core/kv_cache_utils.py vllm/envs.py tests/v1/core/test_contiguous_kv_packing.py vllm/distributed/kv_transfer/kv_connector/v1/offloading/worker.py
- passed

Benchmarks / runtime validation

DeepSeek V4 P/D (nvidia/DeepSeek-V4-Flash-NVFP4), run:

Packed KV registered on all prefill/decode ranks with num_regions=1.
Benchmark: 16/16 successful requests.
NIXL: 17 transfers, 141 descriptors, 505,284,480 bytes transferred.

GPT-OSS P/D (openai/gpt-oss-20b) with TRITON_ATTN and cross-layer packed config:

Packed KV registered on prefill/decode with num_regions=1.
Benchmark: 32/32 successful requests.
Perf: 3.72 req/s, 59.66 output tok/s, mean TTFT 372.02 ms, mean TPOT 32.81 ms, mean ITL 32.72 ms.
NIXL: 33 transfers, 279 descriptors, 438,829,056 bytes transferred.

AI assistance

AI assistance was used to prepare this change. This has been reviewed by the submitter.

Assisted-by: OpenAI Codex Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

orozery · 2026-06-22T12:04:15Z

+    enable_cross_layers = (
+        str(extra_config.get("enable_cross_layers_blocks", "False")).lower() == "true"
    )
+    return is_dsv4 or (enable_cross_layers and len(kv_cache_groups) > 1)


Can we apply the packed layout for dense models as well?
i.e. remove the len(kv_cache_groups) > 1 check

iiuc cross layer kv-cache is already supported for non-hybrid models (please correct me if im wrong); id prefer to keep this PR limited in scope since its really just a temporary solution while we complete #42082

Cross layers de facto no longer works since more and more uniform models (e.g. llama, qwen) use MRV2, which does not have it.

Happy to do this as a follow-up but hoping to get this in for 0.24 so I'd rather not increase the scope of this PR (was just intended to be a cleanup fast follow for #46205)

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

…ject#46252) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> (cherry picked from commit e7df232)

…ject#46252) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Qiang Li <qiang.li2@amd.com>

…ject#46252) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

Support packed HMA KV cache layout

28e4b3f

Assisted-by: OpenAI Codex Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

LucasWilkinson requested review from ApostaC, NickLucche, WoosukKwon, alexm-redhat, heheda12345, njhill, orozery, robertgshaw2-redhat, xuechendi and ywang96 as code owners June 20, 2026 21:50

mergify Bot added v1 kv-connector labels Jun 20, 2026

LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 21, 2026

orozery reviewed Jun 22, 2026

View reviewed changes

LucasWilkinson and others added 2 commits June 22, 2026 17:27

add comment

8d0722b

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

Merge branch 'main' into codex/packed-kv-hma

4c5c1e8

tlrmchlsmth approved these changes Jun 24, 2026

View reviewed changes

LucasWilkinson merged commit e7df232 into main Jun 24, 2026
110 checks passed

LucasWilkinson deleted the codex/packed-kv-hma branch June 24, 2026 15:55

nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026

[KV Offload] Gate packed HMA KV cache on cross-layer config (vllm-pro…

9efa8bc

…ject#46252) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

LucasWilkinson added this to the v0.24.0 cherrypick milestone Jun 24, 2026

khluu pushed a commit that referenced this pull request Jun 25, 2026

[KV Offload] Gate packed HMA KV cache on cross-layer config (#46252)

b36db10

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> (cherry picked from commit e7df232)

qli88 pushed a commit to qli88/vllm that referenced this pull request Jun 26, 2026

[KV Offload] Gate packed HMA KV cache on cross-layer config (vllm-pro…

e1c3f95

…ject#46252) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Qiang Li <qiang.li2@amd.com>

wincent8 pushed a commit to wincent8/vllm that referenced this pull request Jun 29, 2026

[KV Offload] Gate packed HMA KV cache on cross-layer config (vllm-pro…

7db7667

…ject#46252) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[KV Offload] Gate packed HMA KV cache on cross-layer config#46252

[KV Offload] Gate packed HMA KV cache on cross-layer config#46252
LucasWilkinson merged 3 commits into
mainfrom
codex/packed-kv-hma

LucasWilkinson commented Jun 20, 2026 •

edited

Loading

orozery Jun 22, 2026

LucasWilkinson Jun 22, 2026

orozery Jun 22, 2026

LucasWilkinson Jun 22, 2026

Uh oh!

Uh oh!

Labels

3 participants

Uh oh!

Uh oh!

Conversation

LucasWilkinson commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tests

Benchmarks / runtime validation

AI assistance

orozery Jun 22, 2026

Choose a reason for hiding this comment

LucasWilkinson Jun 22, 2026

Choose a reason for hiding this comment

orozery Jun 22, 2026

Choose a reason for hiding this comment

LucasWilkinson Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Labels

3 participants

LucasWilkinson commented Jun 20, 2026 •

edited

Loading