[Bugfix] Defer offload reads while transfers are pending#46231
Conversation
|
Thanks for reviewing. The pre-run-check is blocked because this account has fewer than 4 merged PRs and the PR does not yet have a ready/verified label. Could a maintainer please add the appropriate label if this fix is ready for CI? |
orozery
left a comment
There was a problem hiding this comment.
Thanks @Palaiologos1453.
Please see my comment on the issue:
#46014 (comment)
2114400 to
ec2d969
Compare
|
I pushed an update with a scheduler-level regression test for this exact async batch-queue ordering. The new test first creates pending store jobs, then calls Local checks I could run in this environment:
The targeted pytest still cannot run on this Windows checkout because importing vLLM tries to load the unbuilt |
|
Hi @Palaiologos1453, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
Signed-off-by: test test <2260891073@qq.com>
ec2d969 to
9dcb4a8
Compare
orozery
left a comment
There was a problem hiding this comment.
@Palaiologos1453 Thanks for this fix!
…t#46231) Signed-off-by: test test <2260891073@qq.com>
…t#46231) Signed-off-by: test test <2260891073@qq.com>
…t#46231) Signed-off-by: test test <2260891073@qq.com> Signed-off-by: Qiang Li <qiang.li2@amd.com>
Fixes #46014.
This makes the offloading scheduler defer prefix-cache lookup for a request while that request still has in-flight transfer jobs. In the preemption/re-admission race described in the issue, this prevents the scheduler from issuing a load while a previously flushed store is still tracked in
transfer_jobs.The request is retried on a later scheduling step after the worker completion is consumed and the transfer set drains.
Test coverage:
get_num_new_matched_tokens()to verify that a pending transfer returns(None, False), does not calllookup, and clears stale block ids for the attempted admission.Local verification:
python -m pytest --confcutdir=tests/v1/kv_connector/unit/offloading_connector tests/v1/kv_connector/unit/offloading_connector/test_scheduler.py -k pending_transfer_defers_prefix_lookup -quvloopstub becauseuvloopdoes not support Windows; the tested path does not use the event loop implementation.python -m compileall -q vllm/distributed/kv_transfer/kv_connector/v1/offloading/scheduler.py tests/v1/kv_connector/unit/offloading_connector/test_scheduler.pygit diff --check