Skip to content

[EPLB] Enable nixl eplb communicator for elastic ep#45013

Merged
pavanimajety merged 14 commits into
vllm-project:mainfrom
neuralmagic:imarkov/eplb-nixl-with-elastic-ep
Jun 22, 2026
Merged

[EPLB] Enable nixl eplb communicator for elastic ep#45013
pavanimajety merged 14 commits into
vllm-project:mainfrom
neuralmagic:imarkov/eplb-nixl-with-elastic-ep

Conversation

@ilmarkov

@ilmarkov ilmarkov commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Purpose

Enable NixlEplbCommunicator for elastic EP, allowing async EPLB during elastic scale-up/down.

Key changes:

  1. Drain-before-scale. Consume all pending async transfers before groups are replaced. Wait for all layers to finish rather than stop in the middle to avoid cross-rank and cross-thread races.
  2. Start async thread after scale-up on new workers.
  3. Deferred NIXL remote setup. Postpone collective metadata
    exchange to first set_transfer_context() to avoid deadlocks.
  4. Replace monitored_barrier with all_reduce + wait(timeout) for stateless groups in Nixl EPLB communicator.
  5. Prefer NIXL in auto-selection for all EP modes.

Test Plan

tests/distributed/test_elastic_ep.py. Test params tuned (step_interval=10, window_size=5) to actually trigger EPLB.

  • tests/distributed/test_eplb_execute.py — new
    test_nixl_deferred_init verifying the deferred-init path end-to-end.

Test Result

EPLB and elastic EP tests pass (both sync and async variants).


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
ilmarkov added 2 commits June 9, 2026 11:38
Signed-off-by: Markov Ilya <markovilya197@gmail.com>
Signed-off-by: Markov Ilya <markovilya197@gmail.com>
@tlrmchlsmth tlrmchlsmth added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 9, 2026
@tlrmchlsmth tlrmchlsmth enabled auto-merge (squash) June 9, 2026 13:04
Signed-off-by: Markov Ilya <markovilya197@gmail.com>
auto-merge was automatically disabled June 10, 2026 13:51

Head branch was pushed to by a user without write access

@mergify

mergify Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ilmarkov.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 10, 2026
Signed-off-by: Markov Ilya <markovilya19@gmail.com>
@mergify mergify Bot removed the needs-rebase label Jun 11, 2026
@mergify

mergify Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Hi @ilmarkov, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Markov Ilya added 2 commits June 11, 2026 11:16
@mergify

mergify Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor
@mergify mergify Bot added the documentation Improvements or additions to documentation label Jun 11, 2026
Comment thread vllm/config/parallel.py Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be removed?

@ilmarkov ilmarkov Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, missed during the merge. Thanks!

Signed-off-by: Markov Ilya <markovilya19@gmail.com>

@itayalroy itayalroy left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We’ve been investigating this internally in NIXL team, and I think you may also be missing an update to the EPLB group after scale-up/down, see this reference: rtourgeman@e0d4e1c.

Without it, EPLB after scale up will hit an assert and no longer run

@ilmarkov

Copy link
Copy Markdown
Contributor Author

@itayalroy The fix makes sense. We also need to update the tests to actually run EPLB, as they didn't catch the issue due to probably too high step_interval.

Signed-off-by: Markov Ilya <markovilya19@gmail.com>
@ilmarkov

Copy link
Copy Markdown
Contributor Author

Update of the tests reveals a bunch of bugs in elastic EP + async EPLB. For example, we don't track if the async EPLB transfer is finished and we can start scaling up/down while the transfers are still in fly. We probably need to update AsyncEplbLayerResult.

@itayalroy

Copy link
Copy Markdown
Contributor

Update of the tests reveals a bunch of bugs in elastic EP + async EPLB. For example, we don't track if the async EPLB transfer is finished and we can start scaling up/down while the transfers are still in fly. We probably need to update AsyncEplbLayerResult.

Right, elastic EP suppressed the sync EPLB by flipping some boolean, relying on that EPLB runs on the same thread as the elastic state machine. This is no longer true with async EPLB, I think we might need "suppress eplb" to flush and stop the EPLB thread, and "resume eplb" to restart it

@mergify

mergify Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @ilmarkov.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label Jun 15, 2026
Markov Ilya added 2 commits June 15, 2026 09:38
Signed-off-by: Markov Ilya <markovilya19@gmail.com>
Signed-off-by: Markov Ilya <markovilya19@gmail.com>
@mergify mergify Bot removed the needs-rebase label Jun 15, 2026
@ilmarkov

Copy link
Copy Markdown
Contributor Author

@itayalroy Now async EPLB + elastic EP seems to work. PTAL

@itayalroy itayalroy left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solution looks good, left a few comments reg. implementation

Comment thread vllm/distributed/elastic_ep/elastic_execute.py Outdated
Comment thread vllm/distributed/elastic_ep/elastic_execute.py Outdated
Comment thread vllm/distributed/elastic_ep/elastic_execute.py Outdated
@ilmarkov

Copy link
Copy Markdown
Contributor Author

@itayalroy The comments are addressed. Please, give another round of review.

Comment thread tests/distributed/test_elastic_ep.py
Comment thread tests/distributed/test_elastic_ep.py
@itayalroy

Copy link
Copy Markdown
Contributor

@ilmarkov thanks for the work! :)

@pavanimajety pavanimajety merged commit ac61458 into vllm-project:main Jun 22, 2026
85 of 86 checks passed
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
Signed-off-by: Markov Ilya <markovilya197@gmail.com>
Signed-off-by: Markov Ilya <markovilya19@gmail.com>
Co-authored-by: Markov Ilya <markovilya19@gmail.com>
qli88 pushed a commit to qli88/vllm that referenced this pull request Jun 26, 2026
Signed-off-by: Markov Ilya <markovilya197@gmail.com>
Signed-off-by: Markov Ilya <markovilya19@gmail.com>
Co-authored-by: Markov Ilya <markovilya19@gmail.com>
Signed-off-by: Qiang Li <qiang.li2@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation kv-connector ready ONLY add when PR is ready to merge/full CI is needed

5 participants