Skip to content

[KV Offload] Support packed HMA KV cache layout#46205

Merged
tlrmchlsmth merged 2 commits into
vllm-project:mainfrom
neuralmagic:codex/packed-kv-hma
Jun 20, 2026
Merged

[KV Offload] Support packed HMA KV cache layout#46205
tlrmchlsmth merged 2 commits into
vllm-project:mainfrom
neuralmagic:codex/packed-kv-hma

Conversation

@LucasWilkinson

@LucasWilkinson LucasWilkinson commented Jun 20, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • add an opt-in VLLM_USE_PACKED_HMA_KV_CACHE path for multi-group HMA KV cache packing
  • keep the existing DeepSeek V4 packed path unchanged
  • register packed HMA offload as one canonical backing tensor with one full-row ref per KV group, preserving the packed topology for CPU offload

Duplicate-work check

Benchmarks

  • openai/gpt-oss-20b, B300, 128K, OffloadingConnector, 2 CPU-hit iterations: packed HMA full-row refs used 1 CPU tensor and averaged ~124.95 ms vs per-slice registration with 12 CPU tensors at ~144.37 ms (~13.5% faster).
  • google/gemma-3-1b-it, 4K: packed HMA used 1 CPU tensor vs 4 and CPU-hit latency was effectively flat, ~12.23 ms vs ~12.32 ms.

Tests

  • .venv/bin/python -m pytest tests/v1/core/test_contiguous_kv_packing.py tests/v1/simple_kv_offload/test_scheduler.py tests/v1/kv_connector/unit/offloading_connector/test_worker.py tests/v1/kv_offload/cpu/test_gpu_worker.py -q
  • .venv/bin/pre-commit run ruff-check --files vllm/v1/kv_offload/base.py vllm/v1/kv_offload/cpu/gpu_worker.py vllm/distributed/kv_transfer/kv_connector/v1/offloading/worker.py
  • commit hook also ran ruff check, ruff format, typos, mypy py3.10, SPDX, config validation, and other repository hooks successfully.

AI Assistance

AI assistance was used to implement and iterate on this change. This PR has been reviewed by the author.

Add an opt-in packed KV cache layout for multi-group HMA models while preserving the existing DeepSeek V4 packed path. For HMA offloading, register the packed backing as one canonical tensor and use one full-row ref per KV group so CPU offload keeps the packed topology instead of allocating/copying per-slice tensors.

Benchmark notes:

- openai/gpt-oss-20b, B300, 128K, OffloadingConnector, 2 CPU-hit iterations: packed HMA full-row refs used 1 CPU tensor and averaged ~124.95 ms vs per-slice registration with 12 CPU tensors at ~144.37 ms (~13.5% faster).

- google/gemma-3-1b-it, 4K: packed HMA used 1 CPU tensor vs 4 and CPU-hit latency was effectively flat, ~12.23 ms vs ~12.32 ms.

Tests:

- .venv/bin/python -m pytest tests/v1/core/test_contiguous_kv_packing.py tests/v1/simple_kv_offload/test_scheduler.py tests/v1/kv_connector/unit/offloading_connector/test_worker.py tests/v1/kv_offload/cpu/test_gpu_worker.py -q

- .venv/bin/pre-commit run ruff-check --files vllm/v1/kv_offload/base.py vllm/v1/kv_offload/cpu/gpu_worker.py vllm/distributed/kv_transfer/kv_connector/v1/offloading/worker.py

Co-authored-by: OpenAI Codex <codex@openai.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
@LucasWilkinson LucasWilkinson added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 20, 2026
@LucasWilkinson LucasWilkinson marked this pull request as ready for review June 20, 2026 02:40
@tlrmchlsmth tlrmchlsmth enabled auto-merge (squash) June 20, 2026 19:28
@tlrmchlsmth tlrmchlsmth merged commit cc22621 into vllm-project:main Jun 20, 2026
91 checks passed
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Jun 21, 2026
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
qli88 pushed a commit to qli88/vllm that referenced this pull request Jun 26, 2026
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Qiang Li <qiang.li2@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kv-connector ready ONLY add when PR is ready to merge/full CI is needed v1

2 participants