[KV offload] Parallel-agnostic fs-tier cache for single full-attention group by Etelis · Pull Request #44733 · vllm-project/vllm

Etelis · 2026-06-06T12:49:36Z

Tests

pytest tests/v1/kv_offload/tiering/test_fs_tier.py — adds 3 predicate cases (single full-attn →
agnostic; multi-group → off; non-full-attn → off).

Validated end-to-end on 4×H100 (store TP=2 → load TP=4, same cache dir):

Model	Result
Qwen2.5-7B (full attention)	400/403 tokens loaded from the TP=2 cache; output identical to fresh TP=4
DeepSeek-V2-Lite (MLA)	`Short read` → recompute → correct output
Qwen2.5-3B (GQA, replicated at TP=4)	`Short read` → recompute → correct output

…n group Enable parallel_agnostic=True only for a single full-attention KV-cache group so the fs-tier offload cache can be reused across tensor/pipeline parallel sizes. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

orozery · 2026-06-08T05:55:05Z

+        # A single full-attention group has a parallelism-invariant offloaded
+        # block, so share its cache across parallel sizes. Replicated KV (MLA,
+        # small-GQA) is world_size-scaled and fails closed on the load size check.
+        kv_cache_groups = offloading_spec.kv_cache_config.kv_cache_groups
+        parallel_agnostic = len(kv_cache_groups) == 1 and isinstance(
+            kv_cache_groups[0].kv_cache_spec, FullAttentionSpec
+        )


Let's move this logic inside the FileMapper (so other tiers can "enjoy" it as well).
And set parallel_agnostic=True here (and in the obj tier as well)

You're right,
sorry about that.

Tiers opt in with parallel_agnostic=True; FileMapper gates it on a single full-attention group. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

MLA latent KV is replicated per rank, never head-sharded. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

…n group (vllm-project#44733) Signed-off-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Itay Etelis <itay.etelis@ibm.com>

The parallel-agnostic fs-tier cache (vllm-project#44733) collapses tp/pp/pcp/dcp and rank out of the cache namespace for a single full-attention group, on the assumption that the offloaded KV blocks are parallelism-invariant. That invariant is not known to hold under the V2 model runner, whose KV layout may differ, so sharing a cache directory across layouts there could alias distinct blocks. Gate the opt-in on the canonical vllm_config.use_v2_model_runner property (not a raw env-var read, so config-driven V2 defaults such as diffusion models are handled correctly). The check sits alongside the existing MLA and multi-group exclusions in FileMapper, keeping the call sites unchanged. Adds a regression test asserting tp/rank are not collapsed when the V2 model runner is active. Co-authored-by: Claude Signed-off-by: Itay Etelis <etelis2019@gmail.com>

The parallel-agnostic fs-tier cache (vllm-project#44733) collapses tp/pp/pcp/dcp and rank out of the cache namespace for a single full-attention group, assuming the offloaded blocks are parallelism-invariant. That assumption is not known to hold under the V2 model runner, so gate the opt-in on vllm_config.use_v2_model_runner alongside the existing MLA and multi-group exclusions. Signed-off-by: Itay Etelis <etelis2019@gmail.com>

…n group (vllm-project#44733) Signed-off-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Itay Etelis <itay.etelis@ibm.com> Signed-off-by: divineearthly <divineearthly@gmail.com>

…n group (vllm-project#44733) Signed-off-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Itay Etelis <itay.etelis@ibm.com>

Etelis requested review from ApostaC and orozery as code owners June 6, 2026 12:49

claude Bot reviewed Jun 6, 2026

View reviewed changes

mergify Bot added the v1 label Jun 6, 2026

orozery requested changes Jun 8, 2026

View reviewed changes

[KV offload] Move parallel-agnostic predicate into FileMapper

c89e547

Tiers opt in with parallel_agnostic=True; FileMapper gates it on a single full-attention group. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

Etelis requested a review from orozery June 9, 2026 06:05

orozery requested changes Jun 9, 2026

View reviewed changes

Comment thread vllm/v1/kv_offload/file_mapper.py Outdated

Comment thread tests/v1/kv_offload/tiering/test_fs_tier.py Outdated

Comment thread vllm/v1/kv_offload/tiering/fs/manager.py

EtelisIBM added 2 commits June 10, 2026 14:41

[KV offload] Exclude MLA from parallel-agnostic fs cache

784b39a

MLA latent KV is replicated per rank, never head-sharded. Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

[KV offload] Move parallel-agnostic predicate tests to test_file_mapper

105e214

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

orozery reviewed Jun 10, 2026

View reviewed changes

Comment thread tests/v1/kv_offload/tiering/test_fs_tier.py Outdated

EtelisIBM added 2 commits June 10, 2026 15:31

Merge branch 'main' into kv-offload-parallel-agnostic-fs

a05a5c4

[KV offload] Opt obj tier into parallel-agnostic cache sharing

33b7ad9

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>

orozery added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 10, 2026

Etelis added 3 commits June 10, 2026 17:11

Merge branch 'main' into kv-offload-parallel-agnostic-fs

0de9314

Merge branch 'main' into kv-offload-parallel-agnostic-fs

290632d

Merge branch 'main' into kv-offload-parallel-agnostic-fs

ab9e80b

orozery approved these changes Jun 11, 2026

View reviewed changes

orozery merged commit c3662b3 into vllm-project:main Jun 11, 2026
63 checks passed

Etelis mentioned this pull request Jun 18, 2026

[KV offload] Disable parallel-agnostic fs-tier cache on V2 model runner Etelis/vllm#2

Closed

Etelis mentioned this pull request Jun 18, 2026

[KV Connector][Offloading] Disable parallel-agnostic fs-tier cache on V2 model runner #46044

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[KV offload] Parallel-agnostic fs-tier cache for single full-attention group#44733

[KV offload] Parallel-agnostic fs-tier cache for single full-attention group#44733
orozery merged 9 commits into
vllm-project:mainfrom
Etelis:kv-offload-parallel-agnostic-fs

Etelis commented Jun 6, 2026

claude Bot left a comment

orozery Jun 8, 2026

Etelis Jun 9, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Labels

3 participants

Uh oh!

Uh oh!

Conversation

Etelis commented Jun 6, 2026

Tests

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

orozery Jun 8, 2026

Choose a reason for hiding this comment

Etelis Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Labels

3 participants