[KV Connector][Mooncake] Async lookup to reduce scheduler overhead by ivanium · Pull Request #45659 · vllm-project/vllm

ivanium · 2026-06-15T07:16:33Z

Purpose

Looking up keys in Mooncake currently happens synchronously inside get_num_new_matched_tokens, on the scheduler's critical path. Each lookup costs ~1–2 ms per request on average, and that latency is paid serially during scheduling. This PR adds an optional async lookup mode that offloads the lookup to a background thread so its latency overlaps with the current scheduling step; results are consumed on a later step.

When async mode is on, get_num_new_matched_tokens submits the lookup and returns (None, False). The V1 scheduler already understands a None external-token count and re-queues the request to retry on a later step (vllm/v1/core/sched/scheduler.py, the ext_tokens is None branch), so no scheduler change is required. Once the background result is ready, a subsequent call returns the hit length normally.

Changes

worker.py — LookupKeyClient gains a daemon background thread, a job_queue, and get_or_submit() / discard() / process_lookups(). lookup() and reset() now serialize socket access with a socket_lock because the ZMQ REQ socket is shared across threads and is not thread-safe. close() stops the thread via a sentinel before tearing the socket down.
scheduler.py — adds a lookup_async flag (default False, opt-in via kv_connector_extra_config). When enabled, get_num_new_matched_tokens defers via get_or_submit; finished/aborted requests call client.discard() so a stale result is never served.
connector.py — return type widened to tuple[int | None, bool] to carry the deferral signal.
Tests — 3 new unit tests covering the async submit/poll path, discard(), and the scheduler deferral-then-report behavior.

Default is False: existing deployments keep the current synchronous, deterministic-per-step behavior unless they explicitly set kv_connector_extra_config: {"lookup_async": true}.

Not a duplicate

No open PR addresses async Mooncake lookup. #45503 touches LookupKeyServer.close() and worker teardown wiring — a different class and concern from the LookupKeyClient async-thread path here; the two are orthogonal and do not conflict.

Test plan

.venv/bin/python -m pytest tests/v1/kv_connector/unit/test_mooncake_store_connector.py -v

Result: 27 passed (includes the 3 new async tests).

Notes

AI assistance (Claude Code) was used to prepare this change. The author has reviewed every line and is responsible for the change end-to-end.

mergify · 2026-06-16T23:45:37Z

Documentation preview: https://vllm--45659.org.readthedocs.build/en/45659/

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

ivanium · 2026-06-16T23:50:23Z

cc @wzhao18 @Dao007forever @zhewenl

zhewenl

LGTM!

zhewenl · 2026-06-17T16:38:40Z

        result = int.from_bytes(resp, "big")
        return result

+    def get_or_submit(


rename to poll_lookup?

I feel poll_lookup indicates a blocking call, while here we want it to be non-blocking. But I also feel the current name is a bit weird. Will think a better name

How about try_lookup?

Dao007forever · 2026-06-17T17:17:57Z

+                # Sentinel from close(): stop the background thread.
+                return
+            req_id, token_len, block_hashes = job
+            with self.state_lock:


Do we need this lock if it's running in the same scheduler process?

We'll still need it for multi-threaded case. Like even when later we move this part to the scheduler side, it can still run in a background thread.

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

njhill

Thanks @ivanium this looks mostly good to me. I think it may be better rather that using a lock for the socket, to just ensure that its only accessed from one thread, whether that's the calling thread or the async thread.

Also I think there is technically a correctness issue where in consecutive scheduler steps the get_num_matched_tokens() call for the same requests could have different num_computed_tokens values (if the gpu cache changed), so the num matched tokens returned could be incorrect in this case. Sorry ignore this, I made a incorrect assumption about the impl.

And it might be simpler overall to use a single-thread ThreadPoolExecutor with Futures which already handles most of the state management - I'll try this out and push to another branch for consideration.

njhill · 2026-06-18T02:13:08Z

@ivanium here's what I'm thinking of re using futures and having all socket access from single thread rather than using locks: njhill@03328dd

(sorry I accidentally pushed to this PR branch first, so force-pushed to revert it)

Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

Dao007forever

🚀

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

…llm-project#45659) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: divineearthly <divineearthly@gmail.com>

…llm-project#45659) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com>

…llm-project#45659) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Qiang Li <qiang.li2@amd.com>

mergify Bot added v1 kv-connector labels Jun 15, 2026

mergify Bot added the documentation Improvements or additions to documentation label Jun 16, 2026

ivanium added 5 commits June 16, 2026 23:48

feat (mk store): async lookup to avoid any scheduler delay

4d6d727

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

chore: skip cancelled reqs

5dd02ad

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

chore: minor fixes

39c9581

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

chore: format

14210ce

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

chore: doc updates

8b80fad

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

ivanium force-pushed the mk-store/async-lookup branch from c366cb4 to 8b80fad Compare June 16, 2026 23:48

ivanium marked this pull request as ready for review June 16, 2026 23:49

ivanium requested review from ApostaC, NickLucche, orozery and xuechendi as code owners June 16, 2026 23:49

zhewenl approved these changes Jun 17, 2026

View reviewed changes

Dao007forever reviewed Jun 17, 2026

View reviewed changes

This was referenced Jun 17, 2026

[Perf][KVConnector][Mooncake] Compact chunk-hash keys and zero-copy lookup wire format #45969

Merged

[Perf][KVConnector][Mooncake] Parallelize KV load with a receive-thread pool #45971

Merged

chore: rename func

cf1e6ea

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

njhill reviewed Jun 18, 2026

View reviewed changes

njhill force-pushed the mk-store/async-lookup branch from 03328dd to cf1e6ea Compare June 18, 2026 02:09

use futures

7c43af8

Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

Dao007forever approved these changes Jun 18, 2026

View reviewed changes

njhill approved these changes Jun 18, 2026

View reviewed changes

njhill added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 18, 2026

mergify Bot and others added 2 commits June 18, 2026 03:30

Merge branch 'main' into mk-store/async-lookup

209aab0

fix: test cases

6ee9754

Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>

Merge branch 'main' into mk-store/async-lookup

ef1235a

njhill enabled auto-merge (squash) June 18, 2026 20:09

njhill merged commit 35e4dd4 into vllm-project:main Jun 18, 2026
74 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[KV Connector][Mooncake] Async lookup to reduce scheduler overhead#45659

[KV Connector][Mooncake] Async lookup to reduce scheduler overhead#45659
njhill merged 10 commits into
vllm-project:mainfrom
ivanium:mk-store/async-lookup

ivanium commented Jun 15, 2026

mergify Bot commented Jun 16, 2026

ivanium commented Jun 16, 2026

zhewenl left a comment

zhewenl Jun 17, 2026

ivanium Jun 17, 2026

ivanium Jun 17, 2026

Dao007forever Jun 17, 2026

ivanium Jun 17, 2026

njhill left a comment •

edited

Loading

njhill commented Jun 18, 2026 •

edited

Loading

Dao007forever left a comment

Uh oh!

Labels

4 participants

Uh oh!

Uh oh!

Conversation

ivanium commented Jun 15, 2026

Purpose

Changes

Not a duplicate

Test plan

Notes

mergify Bot commented Jun 16, 2026

ivanium commented Jun 16, 2026

zhewenl left a comment

Choose a reason for hiding this comment

zhewenl Jun 17, 2026

Choose a reason for hiding this comment

ivanium Jun 17, 2026

Choose a reason for hiding this comment

ivanium Jun 17, 2026

Choose a reason for hiding this comment

Dao007forever Jun 17, 2026

Choose a reason for hiding this comment

ivanium Jun 17, 2026

Choose a reason for hiding this comment

njhill left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

njhill commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dao007forever left a comment

Choose a reason for hiding this comment

Uh oh!

Labels

4 participants

njhill left a comment •

edited

Loading

njhill commented Jun 18, 2026 •

edited

Loading