Skip to content

[KV offload] Parallel-agnostic fs-tier cache for single full-attention group#44733

Merged
orozery merged 9 commits into
vllm-project:mainfrom
Etelis:kv-offload-parallel-agnostic-fs
Jun 11, 2026
Merged

[KV offload] Parallel-agnostic fs-tier cache for single full-attention group#44733
orozery merged 9 commits into
vllm-project:mainfrom
Etelis:kv-offload-parallel-agnostic-fs

Conversation

@Etelis

@Etelis Etelis commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Tests

pytest tests/v1/kv_offload/tiering/test_fs_tier.py — adds 3 predicate cases (single full-attn →
agnostic; multi-group → off; non-full-attn → off).

Validated end-to-end on 4×H100 (store TP=2 → load TP=4, same cache dir):

Model Result
Qwen2.5-7B (full attention) 400/403 tokens loaded from the TP=2 cache; output identical to fresh TP=4
DeepSeek-V2-Lite (MLA) Short read → recompute → correct output
Qwen2.5-3B (GQA, replicated at TP=4) Short read → recompute → correct output
…n group

Enable parallel_agnostic=True only for a single full-attention KV-cache group so the fs-tier offload cache can be reused across tensor/pipeline parallel sizes.

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
@Etelis Etelis requested review from ApostaC and orozery as code owners June 6, 2026 12:49

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify Bot added the v1 label Jun 6, 2026
Comment on lines +93 to +99
# A single full-attention group has a parallelism-invariant offloaded
# block, so share its cache across parallel sizes. Replicated KV (MLA,
# small-GQA) is world_size-scaled and fails closed on the load size check.
kv_cache_groups = offloading_spec.kv_cache_config.kv_cache_groups
parallel_agnostic = len(kv_cache_groups) == 1 and isinstance(
kv_cache_groups[0].kv_cache_spec, FullAttentionSpec
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this logic inside the FileMapper (so other tiers can "enjoy" it as well).
And set parallel_agnostic=True here (and in the obj tier as well)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right,
sorry about that.

Tiers opt in with parallel_agnostic=True; FileMapper gates it on a single full-attention group.

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
@Etelis Etelis requested a review from orozery June 9, 2026 06:05
Comment thread vllm/v1/kv_offload/file_mapper.py Outdated
Comment thread tests/v1/kv_offload/tiering/test_fs_tier.py Outdated
Comment thread vllm/v1/kv_offload/tiering/fs/manager.py
MLA latent KV is replicated per rank, never head-sharded.

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Comment thread tests/v1/kv_offload/tiering/test_fs_tier.py Outdated
@orozery orozery added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 10, 2026
@orozery orozery merged commit c3662b3 into vllm-project:main Jun 11, 2026
63 checks passed
ryttry pushed a commit to ryttry/vllm that referenced this pull request Jun 11, 2026
…n group (vllm-project#44733)

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Co-authored-by: Itay Etelis <itay.etelis@ibm.com>
Saddss pushed a commit to Saddss/vllm that referenced this pull request Jun 14, 2026
…n group (vllm-project#44733)

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Co-authored-by: Itay Etelis <itay.etelis@ibm.com>
Etelis pushed a commit to Etelis/vllm that referenced this pull request Jun 18, 2026
The parallel-agnostic fs-tier cache (vllm-project#44733) collapses tp/pp/pcp/dcp and
rank out of the cache namespace for a single full-attention group, on the
assumption that the offloaded KV blocks are parallelism-invariant. That
invariant is not known to hold under the V2 model runner, whose KV layout
may differ, so sharing a cache directory across layouts there could alias
distinct blocks.

Gate the opt-in on the canonical vllm_config.use_v2_model_runner property
(not a raw env-var read, so config-driven V2 defaults such as diffusion
models are handled correctly). The check sits alongside the existing MLA
and multi-group exclusions in FileMapper, keeping the call sites unchanged.

Adds a regression test asserting tp/rank are not collapsed when the V2
model runner is active.

Co-authored-by: Claude
Signed-off-by: Itay Etelis <etelis2019@gmail.com>
Etelis pushed a commit to Etelis/vllm that referenced this pull request Jun 18, 2026
The parallel-agnostic fs-tier cache (vllm-project#44733) collapses tp/pp/pcp/dcp and rank
out of the cache namespace for a single full-attention group, assuming the
offloaded blocks are parallelism-invariant. That assumption is not known to
hold under the V2 model runner, so gate the opt-in on
vllm_config.use_v2_model_runner alongside the existing MLA and multi-group
exclusions.

Signed-off-by: Itay Etelis <etelis2019@gmail.com>
divineearthly pushed a commit to divineearthly/vllm that referenced this pull request Jun 19, 2026
…n group (vllm-project#44733)

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Co-authored-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: divineearthly <divineearthly@gmail.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026
…n group (vllm-project#44733)

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Co-authored-by: Itay Etelis <itay.etelis@ibm.com>
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
…n group (vllm-project#44733)

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Co-authored-by: Itay Etelis <itay.etelis@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

3 participants