Skip to content

[KV Connector][Mooncake] Async lookup to reduce scheduler overhead#45659

Merged
njhill merged 10 commits into
vllm-project:mainfrom
ivanium:mk-store/async-lookup
Jun 18, 2026
Merged

[KV Connector][Mooncake] Async lookup to reduce scheduler overhead#45659
njhill merged 10 commits into
vllm-project:mainfrom
ivanium:mk-store/async-lookup

Conversation

@ivanium

@ivanium ivanium commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Purpose

Looking up keys in Mooncake currently happens synchronously inside get_num_new_matched_tokens, on the scheduler's critical path. Each lookup costs ~1–2 ms per request on average, and that latency is paid serially during scheduling. This PR adds an optional async lookup mode that offloads the lookup to a background thread so its latency overlaps with the current scheduling step; results are consumed on a later step.

When async mode is on, get_num_new_matched_tokens submits the lookup and returns (None, False). The V1 scheduler already understands a None external-token count and re-queues the request to retry on a later step (vllm/v1/core/sched/scheduler.py, the ext_tokens is None branch), so no scheduler change is required. Once the background result is ready, a subsequent call returns the hit length normally.

Changes

  • worker.pyLookupKeyClient gains a daemon background thread, a job_queue, and get_or_submit() / discard() / process_lookups(). lookup() and reset() now serialize socket access with a socket_lock because the ZMQ REQ socket is shared across threads and is not thread-safe. close() stops the thread via a sentinel before tearing the socket down.
  • scheduler.py — adds a lookup_async flag (default False, opt-in via kv_connector_extra_config). When enabled, get_num_new_matched_tokens defers via get_or_submit; finished/aborted requests call client.discard() so a stale result is never served.
  • connector.py — return type widened to tuple[int | None, bool] to carry the deferral signal.
  • Tests — 3 new unit tests covering the async submit/poll path, discard(), and the scheduler deferral-then-report behavior.

Default is False: existing deployments keep the current synchronous, deterministic-per-step behavior unless they explicitly set kv_connector_extra_config: {"lookup_async": true}.

Not a duplicate

No open PR addresses async Mooncake lookup. #45503 touches LookupKeyServer.close() and worker teardown wiring — a different class and concern from the LookupKeyClient async-thread path here; the two are orthogonal and do not conflict.

Test plan

.venv/bin/python -m pytest tests/v1/kv_connector/unit/test_mooncake_store_connector.py -v

Result: 27 passed (includes the 3 new async tests).

Notes

AI assistance (Claude Code) was used to prepare this change. The author has reviewed every line and is responsible for the change end-to-end.

@mergify

mergify Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor
@mergify mergify Bot added the documentation Improvements or additions to documentation label Jun 16, 2026
ivanium added 5 commits June 16, 2026 23:48
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
@ivanium ivanium force-pushed the mk-store/async-lookup branch from c366cb4 to 8b80fad Compare June 16, 2026 23:48
@ivanium ivanium marked this pull request as ready for review June 16, 2026 23:49
@ivanium

ivanium commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator Author

@zhewenl zhewenl left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

result = int.from_bytes(resp, "big")
return result

def get_or_submit(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename to poll_lookup?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel poll_lookup indicates a blocking call, while here we want it to be non-blocking. But I also feel the current name is a bit weird. Will think a better name

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about try_lookup?

# Sentinel from close(): stop the background thread.
return
req_id, token_len, block_hashes = job
with self.state_lock:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this lock if it's running in the same scheduler process?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll still need it for multi-threaded case. Like even when later we move this part to the scheduler side, it can still run in a background thread.

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

@njhill njhill left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ivanium this looks mostly good to me. I think it may be better rather that using a lock for the socket, to just ensure that its only accessed from one thread, whether that's the calling thread or the async thread.

Also I think there is technically a correctness issue where in consecutive scheduler steps the get_num_matched_tokens() call for the same requests could have different num_computed_tokens values (if the gpu cache changed), so the num matched tokens returned could be incorrect in this case. Sorry ignore this, I made a incorrect assumption about the impl.

And it might be simpler overall to use a single-thread ThreadPoolExecutor with Futures which already handles most of the state management - I'll try this out and push to another branch for consideration.

@njhill njhill force-pushed the mk-store/async-lookup branch from 03328dd to cf1e6ea Compare June 18, 2026 02:09
@njhill

njhill commented Jun 18, 2026

Copy link
Copy Markdown
Member

@ivanium here's what I'm thinking of re using futures and having all socket access from single thread rather than using locks: njhill@03328dd

(sorry I accidentally pushed to this PR branch first, so force-pushed to revert it)

Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

@Dao007forever Dao007forever left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀

@njhill njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 18, 2026
mergify Bot and others added 2 commits June 18, 2026 03:30
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
@njhill njhill enabled auto-merge (squash) June 18, 2026 20:09
@njhill njhill merged commit 35e4dd4 into vllm-project:main Jun 18, 2026
74 checks passed
divineearthly pushed a commit to divineearthly/vllm that referenced this pull request Jun 19, 2026
…llm-project#45659)

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: divineearthly <divineearthly@gmail.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Jun 21, 2026
…llm-project#45659)

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026
…llm-project#45659)

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
…llm-project#45659)

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
qli88 pushed a commit to qli88/vllm that referenced this pull request Jun 26, 2026
…llm-project#45659)

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Qiang Li <qiang.li2@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation kv-connector ready ONLY add when PR is ready to merge/full CI is needed v1

4 participants