[Perf][KVConnector][Mooncake] Compact chunk-hash keys and zero-copy lookup wire format#45969
Merged
Merged
Conversation
Collaborator
Author
|
cc @wzhao18 and @Dao007forever for comments |
njhill
reviewed
Jun 18, 2026
Comment on lines
+1555
to
+1560
| all_frames = [ | ||
| LOOKUP_MSG, | ||
| token_len.to_bytes(4, byteorder="big"), | ||
| hash_len.to_bytes(2, byteorder="big"), | ||
| b"".join(block_hashes), | ||
| ] |
Member
There was a problem hiding this comment.
nit: use tuple
Suggested change
| all_frames = [ | |
| LOOKUP_MSG, | |
| token_len.to_bytes(4, byteorder="big"), | |
| hash_len.to_bytes(2, byteorder="big"), | |
| b"".join(block_hashes), | |
| ] | |
| all_frames = ( | |
| LOOKUP_MSG, | |
| token_len.to_bytes(4, byteorder="big"), | |
| hash_len.to_bytes(2, byteorder="big"), | |
| b"".join(block_hashes), | |
| ) |
| hashes_str = self.decoder.decode(hash_frames) | ||
| block_hashes = [BlockHash(bytes.fromhex(s)) for s in hashes_str] | ||
| hash_len = int.from_bytes(all_frames[2], byteorder="big") | ||
| blob = bytes(all_frames[3]) |
Member
There was a problem hiding this comment.
Suggested change
| blob = bytes(all_frames[3]) | |
| blob = all_frames[3].buffer |
This will make a copy otherwise.
| of materializing all hashes upfront. | ||
| """ | ||
|
|
||
| def __init__(self, blob: bytes, hash_len: int): |
Member
There was a problem hiding this comment.
combined with other suggestion
Suggested change
| def __init__(self, blob: bytes, hash_len: int): | |
| def __init__(self, blob: memoryview, hash_len: int): |
Contributor
|
This pull request has merge conflicts that must be resolved before it can be |
8342af1 to
39451ac
Compare
Collaborator
Author
|
@njhill Thanks for the comments! I updated the PR with the suggestions. |
njhill
approved these changes
Jun 19, 2026
Contributor
|
Hi @ivanium, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
…head of transfering block hashes Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
f201656 to
4381743
Compare
xuebwang-amd
pushed a commit
to xuebwang-amd/vllm
that referenced
this pull request
Jun 21, 2026
tunglinwood
pushed a commit
to tunglinwood/vllm
that referenced
this pull request
Jun 22, 2026
4 tasks
nkzhenhua
pushed a commit
to nkzhenhua/vllm
that referenced
this pull request
Jun 24, 2026
qli88
pushed a commit
to qli88/vllm
that referenced
this pull request
Jun 26, 2026
…ookup wire format (vllm-project#45969) Signed-off-by: Qiang Li <qiang.li2@amd.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
Two related optimizations to the
MooncakeStoreConnectorprefix-lookup path. Both reduce the cost of moving block hashes around, which becomes acute when the model'sblock_sizeis much larger than the connector'shash_block_size.1. Compact chunk-hash keys
When
block_size > hash_block_size, a singleblock_sizechunk previously keyed Mooncake by concatenating all of its fine-grained sub-hashes (BlockHashListWithBlockSizejoined every sub-hash). The Mooncake key therefore grew linearly with theblock_size / hash_block_sizeratio.This is especially costly for DeepSeek-V4-style configs, which run
block_size=256withhash_block_size=4— a ratio of 64, so every Mooncake key became 64 hash digests concatenated together.Because the engine chains block hashes (each block hash folds in its predecessor), a chunk's last sub-hash already uniquely identifies the whole chunk and its prefix. The new
chunk_hashes_for_block_size/_CompactChunkHashListkey each chunk by that single trailing digest, so the Mooncake key stays one digest regardless of the ratio (64x smaller keys for the DSv4 config above, and proportionally smaller anywhereblock_size != hash_block_size).2. Zero-copy lookup wire format
The lookup RPC previously msgpack-encoded a
list[str]of hex digests — hex doubles the byte count and msgpack adds per-element framing. The protocol now sends ahash_lenframe (u16) followed by the raw fixed-size hashes concatenated back-to-back. The server reconstructs them throughBlobBlockHashes, a lazySequence[BlockHash]view over the flat buffer, so it never materializes the full hash list upfront. This removes the hex inflation, the msgpack framing, and the eager allocation on the hot lookup path.Test Plan
Tests are updated to assert the compact last-sub-hash keying and the new wire format.
Test Result
85 passed.Notes
This is not a duplicate of any open PR. No other open PR touches the chunk-hash key construction or the lookup serialization format; the only adjacent work is my own #45659 (async lookup), which changes the lookup call site but not its payload encoding. This change is rebased to stand alone on
main.AI assistance was used in preparing this change; the author has reviewed every changed line.
🤖 Generated with Claude Code