Skip to content

[KV Offload] Gate packed HMA KV cache on cross-layer config#46252

Merged
LucasWilkinson merged 3 commits into
mainfrom
codex/packed-kv-hma
Jun 24, 2026
Merged

[KV Offload] Gate packed HMA KV cache on cross-layer config#46252
LucasWilkinson merged 3 commits into
mainfrom
codex/packed-kv-hma

Conversation

@LucasWilkinson

@LucasWilkinson LucasWilkinson commented Jun 20, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Use kv_connector_extra_config["enable_cross_layers_blocks"] to opt multi-group HMA layouts into packed KV allocation as laid out in https://docs.vllm.ai/en/stable/features/nixl_connector_usage/#cross-layers-blocks
  • Keep DeepSeek V4-style UniformTypeKVCacheSpecs layouts on the packed path by default.
  • Remove the extra VLLM_USE_PACKED_HMA_KV_CACHE environment flag.
  • Canonicalize packed KV caches in the offloading worker as one full-row tensor/ref per KV group.

Tests

  • UCX_TLS=cuda_ipc,cuda_copy,sm,tcp,self UCX_NET_DEVICES=all .venv/bin/python -m pytest tests/v1/core/test_contiguous_kv_packing.py tests/v1/kv_connector/unit/offloading_connector/test_worker.py tests/v1/kv_connector/unit/test_nixl_simple_cpu_offload.py -q
    • 20 passed, 21 warnings
  • .venv/bin/pre-commit run ruff-check --files vllm/v1/core/kv_cache_utils.py vllm/envs.py tests/v1/core/test_contiguous_kv_packing.py vllm/distributed/kv_transfer/kv_connector/v1/offloading/worker.py
    • passed

Benchmarks / runtime validation

DeepSeek V4 P/D (nvidia/DeepSeek-V4-Flash-NVFP4), run:

  • Packed KV registered on all prefill/decode ranks with num_regions=1.
  • Benchmark: 16/16 successful requests.
  • NIXL: 17 transfers, 141 descriptors, 505,284,480 bytes transferred.

GPT-OSS P/D (openai/gpt-oss-20b) with TRITON_ATTN and cross-layer packed config:

  • Packed KV registered on prefill/decode with num_regions=1.
  • Benchmark: 32/32 successful requests.
  • Perf: 3.72 req/s, 59.66 output tok/s, mean TTFT 372.02 ms, mean TPOT 32.81 ms, mean ITL 32.72 ms.
  • NIXL: 33 transfers, 279 descriptors, 438,829,056 bytes transferred.

AI assistance

AI assistance was used to prepare this change. This has been reviewed by the submitter.

Assisted-by: OpenAI Codex

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
enable_cross_layers = (
str(extra_config.get("enable_cross_layers_blocks", "False")).lower() == "true"
)
return is_dsv4 or (enable_cross_layers and len(kv_cache_groups) > 1)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we apply the packed layout for dense models as well?
i.e. remove the len(kv_cache_groups) > 1 check

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iiuc cross layer kv-cache is already supported for non-hybrid models (please correct me if im wrong); id prefer to keep this PR limited in scope since its really just a temporary solution while we complete #42082

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cross layers de facto no longer works since more and more uniform models (e.g. llama, qwen) use MRV2, which does not have it.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to do this as a follow-up but hoping to get this in for 0.24 so I'd rather not increase the scope of this PR (was just intended to be a cleanup fast follow for #46205)

Comment thread vllm/v1/core/kv_cache_utils.py
LucasWilkinson and others added 2 commits June 22, 2026 17:27
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
@LucasWilkinson LucasWilkinson merged commit e7df232 into main Jun 24, 2026
110 checks passed
@LucasWilkinson LucasWilkinson deleted the codex/packed-kv-hma branch June 24, 2026 15:55
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
@LucasWilkinson LucasWilkinson added this to the v0.24.0 cherrypick milestone Jun 24, 2026
khluu pushed a commit that referenced this pull request Jun 25, 2026
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
(cherry picked from commit e7df232)
qli88 pushed a commit to qli88/vllm that referenced this pull request Jun 26, 2026
…ject#46252)

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Qiang Li <qiang.li2@amd.com>
wincent8 pushed a commit to wincent8/vllm that referenced this pull request Jun 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kv-connector ready ONLY add when PR is ready to merge/full CI is needed v1

3 participants