Skip to content

[v1][kvcache] Honor prefix-cache retention interval for Mamba/linear attention#45845

Merged
WoosukKwon merged 4 commits into
vllm-project:mainfrom
Dao007forever:mamba-retention-interval
Jun 23, 2026
Merged

[v1][kvcache] Honor prefix-cache retention interval for Mamba/linear attention#45845
WoosukKwon merged 4 commits into
vllm-project:mainfrom
Dao007forever:mamba-retention-interval

Conversation

@Dao007forever

@Dao007forever Dao007forever commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Purpose

Wire VLLM_PREFIX_CACHE_RETENTION_INTERVAL to Mamba/linear-attention KV-cache groups, completing the in-code # TODO: Support Mamba/linear attention left by #43447 (which added the mechanism for sliding-window attention only).

Background. Mamba/KDA prefix caching retains a full recurrent-state snapshot once per block_size-token boundary. At small attention block sizes (e.g. 128) each snapshot spans several base blocks, and dense per-boundary retention saturates the KV pool — at block_size 128 the Mamba snapshots occupy ~80% of the blocks — leaving no uncached headroom, so the allocator is forced to evict live attention prefixes. The prefix-cache hit rate then collapses late in long multi-turn runs, dragging down throughput and tail latency, while larger block sizes (256/512) are unaffected (this is in a test setting which allows shorter block-size)

Change. MambaManager.reachable_block_mask now sparsifies state-snapshot retention the same way SlidingWindowManager does: keep one cached state per retention_interval-sized segment (plus the latest replay boundary) instead of one per block. A hit resumes from the nearest retained boundary (at most retention_interval tokens coarser), costing negligible extra prefill while freeing the intermediate snapshots for reuse. Also:

  • MambaManager.cache_blocks now tolerates sparse (unhashed) blocks in the cached range.
  • _validate_prefix_cache_retention_interval now accepts models with a Mamba group.

Default behavior is unchanged: with the interval unset, Mamba caches densely (every boundary), exactly as before.

Why this is not a duplicate

Test Plan

.venv/bin/python -m pytest tests/v1/core/test_prefix_caching.py \
  -k "retention or reachable" -v

.venv/bin/python -m pytest tests/v1/core/ \
  -k "retention or mamba or prefix_cach" -q

E2E: Kimi-Linear-48B-A3B-Instruct (block_size 512, TP4, multi-turn prefix-on, 50 prompts × 60 turns), comparing default vs VLLM_PREFIX_CACHE_RETENTION_INTERVAL=2048.

Test Result

Unit tests (CPU; this change is pure-Python KV-cache scheduling logic):

  • test_prefix_caching.py -k "retention or reachable" — including the new test_mamba_reachable_block_mask_sparsifies_retention10 passed.
  • tests/v1/core/ -k "retention or mamba or prefix_cach" — all change-relevant tests pass in a CPU-only env; the remaining e2e/flash-attn tests in this selection require a full CUDA build + GPU and were not run here.

E2E: validated on Kimi-Linear-48B-A3B-Instruct with the config above; with VLLM_PREFIX_CACHE_RETENTION_INTERVAL=2048


AI assistance disclosure: AI assistance (Claude) was used for this change. The human submitter has reviewed every changed line and run the tests above.

🤖 Generated with Claude Code

…attention

Wire VLLM_PREFIX_CACHE_RETENTION_INTERVAL to Mamba groups, completing the
existing `# TODO: Support Mamba/linear attention` (only sliding-window
attention honored it before).

Mamba/KDA prefix caching retains a full recurrent-state snapshot once per
block_size-token boundary. At small attention block sizes (e.g. 128 under
decoupled hybrid paging) each snapshot spans several base blocks, and dense
per-boundary retention saturates the KV pool — at block_size 128 the Mamba
snapshots occupy ~80% of the blocks — leaving no uncached headroom, so the
allocator is forced to evict live attention prefixes. The prefix-cache hit
rate then collapses late in long multi-turn runs (~85%, down to ~75% under
load) with ~18% lower throughput and ~3x worse p99, while larger block sizes
(256/512) are unaffected.

MambaManager.reachable_block_mask now sparsifies state-snapshot retention the
same way SlidingWindowManager does: keep one cached state per
retention_interval-sized segment (plus the latest replay boundary) instead of
one per block. A hit resumes from the nearest retained boundary (at most
retention_interval tokens coarser), costing negligible extra prefill while
freeing the intermediate snapshots for reuse. Also fixes
MambaManager.cache_blocks to tolerate sparse (unhashed) blocks in the cached
range, and relaxes _validate_prefix_cache_retention_interval to accept models
with a Mamba group.

Validated on Kimi-Linear-48B-A3B-Instruct (decoupled hybrid paging, block_size
128, TP4, multi-turn prefix-on, 50 prompts x 60 turns): with
VLLM_PREFIX_CACHE_RETENTION_INTERVAL=512 the prefix-cache hit rate recovers to
98.5% (parity with block_size 512), throughput to ~200K tok/s, with zero
failed requests. Default behavior (interval unset) is unchanged: Mamba caches
densely.

Test commands run:
  .venv/bin/python -m pytest tests/v1/core/test_prefix_caching.py \
    -k "retention or reachable" -v        # 10 passed
  .venv/bin/python -m pytest tests/v1/core/ \
    -k "retention or decoupled or buddy or mamba or prefix_cach" -q  # 136 passed

This is not a duplicate: no open PR wires retention-interval sparsification to
Mamba/linear-attention groups (the codebase carried it as a TODO).

AI assistance (Claude) was used for this change; the human submitter has
reviewed every changed line.

Signed-off-by: Dao Le <daole@inferact.ai>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Dao Le <Dao007forever@gmail.com>
@Dao007forever Dao007forever force-pushed the mamba-retention-interval branch from 2065198 to e5df62d Compare June 16, 2026 17:34
Dao007forever and others added 2 commits June 16, 2026 10:48
Signed-off-by: Dao Le <Dao007forever@gmail.com>
@WoosukKwon WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 16, 2026
@wzhao18

wzhao18 commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Looks good to me. Thanks!

Just need to update the description for VLLM_PREFIX_CACHE_RETENTION_INTERVAL:

vllm/vllm/envs.py

Line 1055 in 475a6ad

# Applies to sliding-window attention for now but not yet Mamba/linear attention.
to include mamba.

@ivanium ivanium left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense to me at a high level. Left one comment on mamba+spec decoding. Also cc @ZJY0516

Comment thread vllm/v1/core/single_type_kv_cache_manager.py
@QilaiZhang

Copy link
Copy Markdown

Thanks for the PR. I have a small clarification question. My understanding is that Kimi-Linear-48B-A3B-Instruct prefix caching currently only supports align mode, and in align mode it does not save a recurrent-state snapshot at every block_size boundary. I may be missing something here, though. Could you explain how this applies to the Kimi-Linear benchmark case?

@Dao007forever

Dao007forever commented Jun 17, 2026

Copy link
Copy Markdown
Contributor Author

Hi @QilaiZhang, in align-mode, it still deposits a snapshot every block_size tokens. Those linger as unreferenced-but-cached blocks and accumulate as context grows. This causes block reused and a drop in prefix cache hit rate. Live footprint stays ~2 blocks; cached footprint grows ~1 per block_size tokens, which is what the retention interval thins.

(I was running in a test setting which work around the uniform block-size of HMA.)

@QilaiZhang

Copy link
Copy Markdown

@Dao007forever Thanks, that makes sense regarding freed state blocks remaining as unreferenced cached blocks.

One follow-up: in upstream align mode, _mamba_block_aligned_split() seems to align scheduled chunks to a multiple of block_size, but not necessarily force one scheduled chunk per block. If a scheduled chunk spans multiple blocks, intermediate entries are null_blocks and only the chunk-end state is cached.

Was your Kimi-Linear benchmark configured so that aligned prefill chunks are effectively one block_size each, perhaps due to the HMA workaround you mentioned? That would explain the “one snapshot per block_size tokens” behavior.

@Dao007forever

Copy link
Copy Markdown
Contributor Author

You are right that in prefill, we snapshot per chunk end, but in decode, we snapshot every block size still.

@QilaiZhang

Copy link
Copy Markdown

Thanks, that makes sense. I was only thinking about the prefill path and missed the decode-side behavior across block boundaries. The distinction between live footprint and unreferenced-but-cached footprint is helpful. Thanks for clarifying.

@ivanium ivanium left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One quick comment

Comment on lines +1063 to +1073
# (2) Replay boundary. ``get_computed_blocks`` caps hits at
# ``num_prompt - 1``, so an exact prompt replay lands on the latest
# fine-aligned boundary. Sparse retention would otherwise skip its
# state, so keep it explicitly.
if num_prompt_tokens is not None:
latest = (num_prompt_tokens - 1) // alignment_tokens * alignment_tokens
boundary_block = latest // block_size - 1
if start_block <= boundary_block < end_block:
mask[boundary_block - start_block] = True

return mask

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if this part can work because for mamba, we need scheduler side changes to cache the end of the prompt. Maybe we can raise NotImplementedError when num_prompt_tokens is given?

@Dao007forever Dao007forever Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch to raise it! I think we're safe here though — the mask is purely subtractive. A True never forces a block to be cached, it just declines to skip it; the real gate is blk.is_null in cache_full_blocks:

if blk.is_null or (block_mask is not None and not block_mask[i]):
    continue

So the layers stay separate: the scheduler decides where a snapshot lands (which blocks are non-null), and the mask just picks which existing ones to keep. If there's no state at the boundary, it's a null_block and gets skipped regardless of the mask — so we can never fabricate a cache entry over a stateless block. Worst case without scheduler changes is that the branch is inert, not incorrect (and in decode we snapshot every block_size, so it usually does fire).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense!

Comment on lines +1063 to +1073
# (2) Replay boundary. ``get_computed_blocks`` caps hits at
# ``num_prompt - 1``, so an exact prompt replay lands on the latest
# fine-aligned boundary. Sparse retention would otherwise skip its
# state, so keep it explicitly.
if num_prompt_tokens is not None:
latest = (num_prompt_tokens - 1) // alignment_tokens * alignment_tokens
boundary_block = latest // block_size - 1
if start_block <= boundary_block < end_block:
mask[boundary_block - start_block] = True

return mask

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense!

@ivanium ivanium left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@WoosukKwon WoosukKwon merged commit 430a95a into vllm-project:main Jun 23, 2026
54 of 56 checks passed
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
…attention (vllm-project#45845)

Signed-off-by: Dao Le <daole@inferact.ai>
Signed-off-by: Dao Le <Dao007forever@gmail.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
qli88 pushed a commit to qli88/vllm that referenced this pull request Jun 26, 2026
…attention (vllm-project#45845)

Signed-off-by: Dao Le <daole@inferact.ai>
Signed-off-by: Dao Le <Dao007forever@gmail.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Qiang Li <qiang.li2@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed v1

5 participants