[KV Connector][Mooncake] Async lookup to reduce scheduler overhead#45659
Conversation
|
Documentation preview: https://vllm--45659.org.readthedocs.build/en/45659/ |
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
c366cb4 to
8b80fad
Compare
| result = int.from_bytes(resp, "big") | ||
| return result | ||
|
|
||
| def get_or_submit( |
There was a problem hiding this comment.
I feel poll_lookup indicates a blocking call, while here we want it to be non-blocking. But I also feel the current name is a bit weird. Will think a better name
There was a problem hiding this comment.
How about try_lookup?
| # Sentinel from close(): stop the background thread. | ||
| return | ||
| req_id, token_len, block_hashes = job | ||
| with self.state_lock: |
There was a problem hiding this comment.
Do we need this lock if it's running in the same scheduler process?
There was a problem hiding this comment.
We'll still need it for multi-threaded case. Like even when later we move this part to the scheduler side, it can still run in a background thread.
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
There was a problem hiding this comment.
Thanks @ivanium this looks mostly good to me. I think it may be better rather that using a lock for the socket, to just ensure that its only accessed from one thread, whether that's the calling thread or the async thread.
Also I think there is technically a correctness issue where in consecutive scheduler steps the Sorry ignore this, I made a incorrect assumption about the impl.get_num_matched_tokens() call for the same requests could have different num_computed_tokens values (if the gpu cache changed), so the num matched tokens returned could be incorrect in this case.
And it might be simpler overall to use a single-thread ThreadPoolExecutor with Futures which already handles most of the state management - I'll try this out and push to another branch for consideration.
03328dd to
cf1e6ea
Compare
|
@ivanium here's what I'm thinking of re using futures and having all socket access from single thread rather than using locks: njhill@03328dd (sorry I accidentally pushed to this PR branch first, so force-pushed to revert it) |
Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai>
…llm-project#45659) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: divineearthly <divineearthly@gmail.com>
…llm-project#45659) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com>
…llm-project#45659) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com>
…llm-project#45659) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com>
…llm-project#45659) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Qiang Li <qiang.li2@amd.com>
Purpose
Looking up keys in Mooncake currently happens synchronously inside
get_num_new_matched_tokens, on the scheduler's critical path. Each lookup costs ~1–2 ms per request on average, and that latency is paid serially during scheduling. This PR adds an optional async lookup mode that offloads the lookup to a background thread so its latency overlaps with the current scheduling step; results are consumed on a later step.When async mode is on,
get_num_new_matched_tokenssubmits the lookup and returns(None, False). The V1 scheduler already understands aNoneexternal-token count and re-queues the request to retry on a later step (vllm/v1/core/sched/scheduler.py, theext_tokens is Nonebranch), so no scheduler change is required. Once the background result is ready, a subsequent call returns the hit length normally.Changes
worker.py—LookupKeyClientgains a daemon background thread, ajob_queue, andget_or_submit()/discard()/process_lookups().lookup()andreset()now serialize socket access with asocket_lockbecause the ZMQREQsocket is shared across threads and is not thread-safe.close()stops the thread via a sentinel before tearing the socket down.scheduler.py— adds alookup_asyncflag (defaultFalse, opt-in viakv_connector_extra_config). When enabled,get_num_new_matched_tokensdefers viaget_or_submit; finished/aborted requests callclient.discard()so a stale result is never served.connector.py— return type widened totuple[int | None, bool]to carry the deferral signal.discard(), and the scheduler deferral-then-report behavior.Default is
False: existing deployments keep the current synchronous, deterministic-per-step behavior unless they explicitly setkv_connector_extra_config: {"lookup_async": true}.Not a duplicate
No open PR addresses async Mooncake lookup. #45503 touches
LookupKeyServer.close()and worker teardown wiring — a different class and concern from theLookupKeyClientasync-thread path here; the two are orthogonal and do not conflict.Test plan
Result: 27 passed (includes the 3 new async tests).
Notes
AI assistance (Claude Code) was used to prepare this change. The author has reviewed every line and is responsible for the change end-to-end.