[KV Offload] Support packed HMA KV cache layout by LucasWilkinson · Pull Request #46205 · vllm-project/vllm

LucasWilkinson · 2026-06-20T01:53:53Z

Summary

add an opt-in VLLM_USE_PACKED_HMA_KV_CACHE path for multi-group HMA KV cache packing
keep the existing DeepSeek V4 packed path unchanged
register packed HMA offload as one canonical backing tensor with one full-row ref per KV group, preserving the packed topology for CPU offload

Duplicate-work check

Checked open PRs for packed HMA KV cache, gpt-oss gemma packed kv cache, VLLM_USE_PACKED_HMA_KV_CACHE, packed KV cache offloading, and hybrid KV cache offload.
Related work exists, especially [KV Connector] Canonical KV Cache Allocation for HMA Models #37885 and [KV Offload] Reshape the transfer data model: per group specs and offloaded side alignment offset #44865, but this PR is intentionally narrower: it is a flagged alternative to [KV Connector] Canonical KV Cache Allocation for HMA Models #37885 for gpt-oss/Gemma-style HMA packing and preserves [DSv4] Pack KV caches into contiguous per-block allocations for DeepSeek V4 #44577/DeepSeek V4 behavior.
[DSv4] Pack KV caches into contiguous per-block allocations for DeepSeek V4 #44577 is merged and this PR builds on that packed layout support rather than duplicating the DSV4 change.

Benchmarks

openai/gpt-oss-20b, B300, 128K, OffloadingConnector, 2 CPU-hit iterations: packed HMA full-row refs used 1 CPU tensor and averaged ~124.95 ms vs per-slice registration with 12 CPU tensors at ~144.37 ms (~13.5% faster).
google/gemma-3-1b-it, 4K: packed HMA used 1 CPU tensor vs 4 and CPU-hit latency was effectively flat, ~12.23 ms vs ~12.32 ms.

Tests

.venv/bin/python -m pytest tests/v1/core/test_contiguous_kv_packing.py tests/v1/simple_kv_offload/test_scheduler.py tests/v1/kv_connector/unit/offloading_connector/test_worker.py tests/v1/kv_offload/cpu/test_gpu_worker.py -q
.venv/bin/pre-commit run ruff-check --files vllm/v1/kv_offload/base.py vllm/v1/kv_offload/cpu/gpu_worker.py vllm/distributed/kv_transfer/kv_connector/v1/offloading/worker.py
commit hook also ran ruff check, ruff format, typos, mypy py3.10, SPDX, config validation, and other repository hooks successfully.

AI Assistance

AI assistance was used to implement and iterate on this change. This PR has been reviewed by the author.

Add an opt-in packed KV cache layout for multi-group HMA models while preserving the existing DeepSeek V4 packed path. For HMA offloading, register the packed backing as one canonical tensor and use one full-row ref per KV group so CPU offload keeps the packed topology instead of allocating/copying per-slice tensors. Benchmark notes: - openai/gpt-oss-20b, B300, 128K, OffloadingConnector, 2 CPU-hit iterations: packed HMA full-row refs used 1 CPU tensor and averaged ~124.95 ms vs per-slice registration with 12 CPU tensors at ~144.37 ms (~13.5% faster). - google/gemma-3-1b-it, 4K: packed HMA used 1 CPU tensor vs 4 and CPU-hit latency was effectively flat, ~12.23 ms vs ~12.32 ms. Tests: - .venv/bin/python -m pytest tests/v1/core/test_contiguous_kv_packing.py tests/v1/simple_kv_offload/test_scheduler.py tests/v1/kv_connector/unit/offloading_connector/test_worker.py tests/v1/kv_offload/cpu/test_gpu_worker.py -q - .venv/bin/pre-commit run ruff-check --files vllm/v1/kv_offload/base.py vllm/v1/kv_offload/cpu/gpu_worker.py vllm/distributed/kv_transfer/kv_connector/v1/offloading/worker.py Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: OpenAI Codex <codex@openai.com> Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Co-authored-by: OpenAI Codex <codex@openai.com> Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Signed-off-by: Qiang Li <qiang.li2@amd.com>

mergify Bot added v1 kv-connector labels Jun 20, 2026

LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 20, 2026

LucasWilkinson marked this pull request as ready for review June 20, 2026 02:40

LucasWilkinson requested review from ApostaC, NickLucche, WoosukKwon, alexm-redhat, heheda12345, njhill, orozery, robertgshaw2-redhat, xuechendi and ywang96 as code owners June 20, 2026 02:40

tlrmchlsmth approved these changes Jun 20, 2026

View reviewed changes

Merge branch 'main' into codex/packed-kv-hma

83127f6

tlrmchlsmth enabled auto-merge (squash) June 20, 2026 19:28

tlrmchlsmth merged commit cc22621 into vllm-project:main Jun 20, 2026
91 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[KV Offload] Support packed HMA KV cache layout#46205

[KV Offload] Support packed HMA KV cache layout#46205
tlrmchlsmth merged 2 commits into
vllm-project:mainfrom
neuralmagic:codex/packed-kv-hma

LucasWilkinson commented Jun 20, 2026 •

edited

Loading

Uh oh!

Labels

2 participants

Uh oh!

Uh oh!

Conversation

LucasWilkinson commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Duplicate-work check

Benchmarks

Tests

AI Assistance

Uh oh!

Labels

2 participants

LucasWilkinson commented Jun 20, 2026 •

edited

Loading