[KV Offload] Gate packed HMA KV cache on cross-layer config#46252
Conversation
Assisted-by: OpenAI Codex Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
| enable_cross_layers = ( | ||
| str(extra_config.get("enable_cross_layers_blocks", "False")).lower() == "true" | ||
| ) | ||
| return is_dsv4 or (enable_cross_layers and len(kv_cache_groups) > 1) |
There was a problem hiding this comment.
Can we apply the packed layout for dense models as well?
i.e. remove the len(kv_cache_groups) > 1 check
There was a problem hiding this comment.
iiuc cross layer kv-cache is already supported for non-hybrid models (please correct me if im wrong); id prefer to keep this PR limited in scope since its really just a temporary solution while we complete #42082
There was a problem hiding this comment.
Cross layers de facto no longer works since more and more uniform models (e.g. llama, qwen) use MRV2, which does not have it.
There was a problem hiding this comment.
Happy to do this as a follow-up but hoping to get this in for 0.24 so I'd rather not increase the scope of this PR (was just intended to be a cleanup fast follow for #46205)
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
…ject#46252) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> (cherry picked from commit e7df232)
…ject#46252) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Qiang Li <qiang.li2@amd.com>
…ject#46252) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Summary
kv_connector_extra_config["enable_cross_layers_blocks"]to opt multi-group HMA layouts into packed KV allocation as laid out in https://docs.vllm.ai/en/stable/features/nixl_connector_usage/#cross-layers-blocksUniformTypeKVCacheSpecslayouts on the packed path by default.VLLM_USE_PACKED_HMA_KV_CACHEenvironment flag.Tests
UCX_TLS=cuda_ipc,cuda_copy,sm,tcp,self UCX_NET_DEVICES=all .venv/bin/python -m pytest tests/v1/core/test_contiguous_kv_packing.py tests/v1/kv_connector/unit/offloading_connector/test_worker.py tests/v1/kv_connector/unit/test_nixl_simple_cpu_offload.py -q20 passed, 21 warnings.venv/bin/pre-commit run ruff-check --files vllm/v1/core/kv_cache_utils.py vllm/envs.py tests/v1/core/test_contiguous_kv_packing.py vllm/distributed/kv_transfer/kv_connector/v1/offloading/worker.pyBenchmarks / runtime validation
DeepSeek V4 P/D (
nvidia/DeepSeek-V4-Flash-NVFP4), run:num_regions=1.GPT-OSS P/D (
openai/gpt-oss-20b) withTRITON_ATTNand cross-layer packed config:num_regions=1.AI assistance
AI assistance was used to prepare this change. This has been reviewed by the submitter.