[EPLB] Enable nixl eplb communicator for elastic ep#45013
Conversation
Signed-off-by: Markov Ilya <markovilya197@gmail.com>
Signed-off-by: Markov Ilya <markovilya197@gmail.com>
Head branch was pushed to by a user without write access
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Markov Ilya <markovilya19@gmail.com>
|
Hi @ilmarkov, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
Signed-off-by: Markov Ilya <markovilya19@gmail.com>
|
Documentation preview: https://vllm--45013.org.readthedocs.build/en/45013/ |
There was a problem hiding this comment.
Shouldn't this be removed?
There was a problem hiding this comment.
Yes, missed during the merge. Thanks!
Signed-off-by: Markov Ilya <markovilya19@gmail.com>
itayalroy
left a comment
There was a problem hiding this comment.
We’ve been investigating this internally in NIXL team, and I think you may also be missing an update to the EPLB group after scale-up/down, see this reference: rtourgeman@e0d4e1c.
Without it, EPLB after scale up will hit an assert and no longer run
|
@itayalroy The fix makes sense. We also need to update the tests to actually run EPLB, as they didn't catch the issue due to probably too high step_interval. |
Signed-off-by: Markov Ilya <markovilya19@gmail.com>
|
Update of the tests reveals a bunch of bugs in elastic EP + async EPLB. For example, we don't track if the async EPLB transfer is finished and we can start scaling up/down while the transfers are still in fly. We probably need to update |
Right, elastic EP suppressed the sync EPLB by flipping some boolean, relying on that EPLB runs on the same thread as the elastic state machine. This is no longer true with async EPLB, I think we might need "suppress eplb" to flush and stop the EPLB thread, and "resume eplb" to restart it |
|
This pull request has merge conflicts that must be resolved before it can be |
Signed-off-by: Markov Ilya <markovilya19@gmail.com>
Signed-off-by: Markov Ilya <markovilya19@gmail.com>
|
@itayalroy Now async EPLB + elastic EP seems to work. PTAL |
Signed-off-by: Markov Ilya <markovilya19@gmail.com>
|
@itayalroy The comments are addressed. Please, give another round of review. |
|
@ilmarkov thanks for the work! :) |
Signed-off-by: Markov Ilya <markovilya197@gmail.com> Signed-off-by: Markov Ilya <markovilya19@gmail.com> Co-authored-by: Markov Ilya <markovilya19@gmail.com>
Signed-off-by: Markov Ilya <markovilya197@gmail.com> Signed-off-by: Markov Ilya <markovilya19@gmail.com> Co-authored-by: Markov Ilya <markovilya19@gmail.com> Signed-off-by: Qiang Li <qiang.li2@amd.com>
Purpose
Enable
NixlEplbCommunicatorfor elastic EP, allowing async EPLB during elastic scale-up/down.Key changes:
exchange to first
set_transfer_context()to avoid deadlocks.monitored_barrierwithall_reduce+wait(timeout)for stateless groups in Nixl EPLB communicator.Test Plan
tests/distributed/test_elastic_ep.py. Test params tuned (step_interval=10,window_size=5) to actually trigger EPLB.tests/distributed/test_eplb_execute.py— newtest_nixl_deferred_initverifying the deferred-init path end-to-end.Test Result
EPLB and elastic EP tests pass (both sync and async variants).
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.