[Bugfix][V1] Fix TOCTOU race causing intermittent `EADDRINUSE` on multi-API-server DP startup by vadiklyutiy · Pull Request #42585 · vllm-project/vllm

vadiklyutiy · 2026-05-14T01:36:06Z

Problem

Running multi-node DP on 2×(4×GB300) (data_parallel_size=8, data_parallel_size_local=4, api_server_count=8, NVFP4 MoE), 5 of 79 startups failed with the same error from ApiServer_0:

(ApiServer_0 pid=...) zmq.error.ZMQError: Address already in use (addr='tcp://<master>:39381')
...
RuntimeError: Process ApiServer_0 (PID: ...) died with exit code 1

Root cause

The rank-0 parent pre-picks ephemeral TCP ports via bind(("", 0)) → close() in get_open_port() and only later spawns each API server child, which does the real ZMQ bind — typically 5–25 s later after model config and scheduler setup. In that window any other process on the host can claim the freed port. The most likely culprit is vLLM itself: it allocates ~30+ TCP ports in the same startup burst (per-API-server inputs/outputs, DP-master pool, handshake, coordinator, stats), and the kernel can hand the same ephemeral port number back to two different get_open_port() calls inside the same parent before either child has had a chance to bind. Stale processes from prior crashed runs amplify the risk (the failing log also shows leaked-semaphore / leaked-shared-memory warnings at shutdown).

Fix

Apply the same bind-in-child + report-back-via-Pipe pattern that DPCoordinator already uses. The parent emits tcp://host:0 placeholders; each ApiServer_i binds its ROUTER/PULL with port 0 (kernel atomically picks a free port), reads the bound endpoints via getsockopt(zmq.LAST_ENDPOINT), and reports them to the parent over a per-child multiprocessing.Pipe. The parent collects the actual addresses before the engine startup handshake propagates them to every engine on every DP node. No protocol changes; opt-in everywhere; IPC paths and the single-client (engines-managed-by-this-client) path are unchanged.

Tests

Unit test

tests/entrypoints/test_api_server_process_manager.py:

End-to-end happy path with defer_addresses=True — 4 children, kernel-picked ports, addresses gathered and verified distinct & well-formed.
gather_actual_addresses() rejects use without defer_addresses=True.
Child exits before reporting → clear RuntimeError, no hang.

Stress testing

On a common case the bug fires ~1/100 runs (DP=8, DP-local=4, 2 nodes) — too rare for reliable verification.

To make it reproducible I ran each compute node with a small "port squatter" that pre-binds every port in /proc/sys/net/ipv4/ip_local_port_range except a small window at the top, leaving _get_open_port() only ~100 ports to choose from. Under that pressure the TOCTOU race fires almost every run.

Squatter (started in background on each node before vllm serve):

import os, socket, sys, time, signal
free_top = int(sys.argv[1])
with open("/proc/sys/net/ipv4/ip_local_port_range") as f:
    lo, hi = map(int, f.read().split())
squat_hi = hi - free_top + 1
held = []
for p in range(lo, squat_hi):
    try:
        s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
        s.bind(("0.0.0.0", p)); s.listen(1)
        held.append(s)
    except OSError:
        pass
print(f"[squatter] holding {len(held)} ports", flush=True)
signal.signal(signal.SIGTERM, lambda *a: sys.exit(0))
while True: time.sleep(3600)

Driver per node:

ulimit -n 65535
python3 squatter.py 100 &        # leave only 100 ports free
exec vllm serve <args>

Results (same config, 2 nodes, DP=8, DP-local=4, squatter at free_top=100):

build	passed	failed
`main`	1	27
this PR	50	0

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

gemini-code-assist

Code Review

This pull request implements a deferred ZMQ port assignment mechanism where child processes bind to kernel-assigned ports and report them back to the parent via pipes. This avoids potential race conditions and port conflicts during initialization. The changes include updates to the APIServerProcessManager to manage these pipes and gather addresses, as well as integration into the CLI and engine client. Review feedback highlights the need for explicit resource management of the multiprocessing pipes, specifically recommending that they be closed immediately after use and during the manager's shutdown to prevent file descriptor leaks.

vadiklyutiy · 2026-05-14T14:26:41Z

/gemini review

gemini-code-assist

Code Review

This pull request implements a mechanism for API servers to defer ZMQ port assignment to the kernel, allowing them to bind to available ports dynamically and report these addresses back to the parent process via pipes. Key changes include the addition of the gather_actual_addresses method to APIServerProcessManager, logic in CoreClient to report bound endpoints, and integration into the CLI server startup process. Feedback points out a type hint inconsistency in client_addresses that requires a type ignore, suggesting a future refactor to dict[str, Any] to restore type safety.

gemini-code-assist · 2026-05-14T14:33:06Z

+                actual_address_pipe: Connection | None = client_addresses.get(  # type: ignore[assignment]
+                    "actual_address_pipe"
+                )


The use of # type: ignore[assignment] here is necessary because client_addresses is currently type-hinted as dict[str, str] | None in the __init__ signature, but it now contains a multiprocessing.connection.Connection object. While fixing the signature is outside the scope of this diff hunk, the current annotation is technically incorrect and forces this suppression. Consider updating the type hint of client_addresses to dict[str, Any] | None in a future refactor to restore type safety.

vadiklyutiy · 2026-05-14T22:33:20Z

I made stress testing and added corresponding section in PR description.

Before run, grabbed free ports and left ~100 free. Results:

build	passed	failed
`main`	1	27
this PR	50	0

vadiklyutiy · 2026-05-15T00:29:40Z

Just found that #40596 does the same

vadiklyutiy · 2026-05-17T16:00:24Z

and #41983 do the same

mergify · 2026-05-23T10:19:26Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @vadiklyutiy.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

robertgshaw2-redhat · 2026-05-24T15:32:12Z

Im wondering, should we just update all cases to launch like this rather than maintaining multiple special cases

robertgshaw2-redhat · 2026-05-24T15:44:57Z

+        # Collect actual ports the children bound and mutate `addresses`
+        # in place; the engine handshake (run on `with` exit) ships these
+        # to every engine on every DP node.
+        actual_inputs, actual_outputs = api_server_manager.gather_actual_addresses()


where are these actually used in the main process?

I don't quite follow where wait_for_completion_or_failure actually runs these

I see, its because this is part of the launch_core_engines context manager and these results are used after the yield in the context manager

this seems very brittle and hard to follow

robertgshaw2-redhat · 2026-05-24T15:54:55Z

            output_addresses=addresses.outputs,
            stats_update_address=stats_update_address,
            tensor_queue=tensor_queue,
+            defer_addresses=True,


IIUC, the APIServerProcessManager is not used anywhere else besides here. Why do we need to support multiple cases?

Agree we could always do this and avoid the arg.

It's also used in tests but AFAICT it shouldn't affect those, or if it does they could be adjusted accordingly.

robertgshaw2-redhat

Thank you for this PR, this fix is long overdue and the fix is solid

In terms of the details of this PR, we are maintaining an "old way" and a "new way". I think this makes the code hard to follow. Why can't we just get rid of the "old way"?

APIServerProcessManager is only used for multi-api server (https://github.com/search?q=repo%3Avllm-project%2Fvllm+APIServerProcessManager&type=code), we should only have once case
launch_core_engines is only used in 2 places (https://github.com/search?q=repo%3Avllm-project%2Fvllm%20launch_core_engines&type=code) --- this should probably not be a context manager and I think we can just use the same mechanism in both cases

njhill · 2026-05-24T18:23:21Z

I agree this fix and PR review is long overdue, really sorry I am not keeping up with my backlog lately.

@robertgshaw2-redhat I agree the way that a context manager is used here is a bit opaque / hard to follow.

It's organized this way so that the engine process management and api server process management can be encapsulated in their own classes, while also allowing the startups / engine handshake to be interleaved in such a way that we parallelize things as much as possible and minimize the startup critical path time (so that we don't wait for the engines to be started before the api servers finish starting up). There's metatdata that needs to be wired in both directions.

The context manager allows this to be done "cleanly" in some respects but we can hopefully think of a better design.

njhill · 2026-05-24T18:25:45Z

Im wondering, should we just update all cases to launch like this rather than maintaining multiple special cases

One difference that the current abstractions cover is whether you are using vllm serve with possibly multiple API server procs, or AsyncLLM directly. The latter case encapsulates the core engine proc handling with AsyncLLM so that it can be used standalone (there are no api servers involved in this case), whereas the former wires things up externally.

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>

ehfd · 2026-05-27T08:17:39Z

Fix for the above issue must be cherry-picked to v0.22.0 too. @khluu @mgoin

vadiklyutiy · 2026-05-27T08:27:05Z

@ehfd

do we need for reproduction --hf-overrides ?
--max-model-len 1010000 looks a bit strange. Is it correct value?

ehfd · 2026-05-27T08:57:37Z

I think it is reproducible with just 262144 context size and without the override.

The override and the context size is based on Qwen's documentation.

vadiklyutiy · 2026-05-27T09:36:13Z

@ehfd fix - #43768

hang Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

ehfd · 2026-05-29T07:29:48Z

#43768 Fixes the issue completely. Thank you!

…ation (vllm-project#43864) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Xiaoran Chen <xiaoran@fb.com>

…ti-API-server DP startup (vllm-project#42585) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Liuweixiong0118 <lwx34158427@gmail.com>

…ation (vllm-project#43864) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Liuweixiong0118 <lwx34158427@gmail.com>

* [MM] Enable FlashInfer metadata support for Qwen2.5-VL vision attention (#42787) Signed-off-by: Hua Huang <huah@nvidia.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> * [Docs] Fix stale version number in token_embed.md (#43488) Signed-off-by: holegots <ikun3.1415927@gmail.com> * [Docs] Fix stale version number in token_classify.md (#43489) Signed-off-by: holegots <ikun3.1415927@gmail.com> * [MoE] Migrate W4A8 CT to oracle kernel setup (#42680) Signed-off-by: Siddharth Bedekar <bedeksid@gmail.com> Co-authored-by: OpenAI Codex <codex@openai.com> * [Mooncake] Add metrics for MooncakeStoreConnector operations (#43392) * [ROCm][Critical] Fix the GDN import bug (#43486) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> * Revert "[Misc] add humming to dependencies" (#43492) * [Bugfix] Fix reasoning dropped on streaming boundary deltas (#42691) Signed-off-by: sfeng33 <4florafeng@gmail.com> * [Model Runner v2] Force v1 runner for tests (#43233) Signed-off-by: yewentao256 <zhyanwentao@126.com> * [KV Connector] Keep MooncakeStore full hits block-aligned (#43494) Signed-off-by: Dao Le <daole@inferact.ai> Signed-off-by: Dao Le <Dao007forever@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> * [kv_offload]: Add DSv4 support (#43142) Signed-off-by: Or Ozeri <oro@il.ibm.com> * [ROCm][CI] Stabilize 400 error return code for invalid schema inputs (#43016) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [ROCm] [DSv4] [Perf] Support DeepSeek v4 MTP (#43385) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> * Tuning script and configs for Triton Mamba SSU kernel (#43083) Signed-off-by: Banani Ghosh <bg2502@nyu.edu> Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com> Co-authored-by: Banani Ghosh <bg2502@nyu.edu> * File system secondary tier implemented in python (#41735) Signed-off-by: Rotem Shavitt <rshavitt@gmail.com> Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Or Ozeri <oro@il.ibm.com> * [Kernel] Add mhc_pre_big_fuse_with_norm_tilelang (#43474) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> * fix: MoE model using shared routed experts crashes on AMD GPUs (#42373) Signed-off-by: weizhou.lan@daocloud.io <weizhou.lan@daocloud.io> * [Docs] Reorganize offline inference docs. (#43552) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Docker] Non-root support for vllm-openai; add opt-in vllm-openai-nonroot target (#40275) Signed-off-by: TheDuyIT <nduy250299@gmail.com> Signed-off-by: dtnguyen <dtnguyen@nvidia.com> Co-authored-by: Claude <noreply@anthropic.com> * [Feat][KVConnector] Support DSV4 in SimpleCPUOffloadBackend (#42296) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> * [Doc] Add section on escalating stalled contributions (#43568) Signed-off-by: esmeetu <jasonailu87@gmail.com> * Reduce memory usage for granite_speech. (#42933) Signed-off-by: Yihuki <wangbovbvb@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [KV Connector] Handle Mooncake finish after preemption (#43281) Signed-off-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Zhewen Li <zhewenli@inferact.ai> * [Misc] Print accuracy value for PD tests even on success (#43583) Signed-off-by: NickLucche <nlucches@redhat.com> * [Kernel] Remove NormGateLinear (#43554) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> * [XPU] Ensure RNG offset alignment with PyTorch requirements in XPU sampler (#43028) Signed-off-by: chaojun-zhang <chaojun.zhang@intel.com> Signed-off-by: Chaojun Zhang <chaojun.zhang@intel.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [LoRA] Add one shot triton kernel For MoE LoRA (#42290) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * [DeepSeek V4] Move MegaMoE input prep kernel to nvidia/ops (#43632) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [KV Connector][Bugfix] MooncakeStore: don't double-apply Eagle prune in load_mask (#43516) Signed-off-by: Dao Le <daole@inferact.ai> Signed-off-by: Dao Le <Dao007forever@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> * [KV Connector] Propagate MooncakeStore load failures (#42788) Signed-off-by: Dao Le <Dao007forever@gmail.com> * [Bugfix] fix device mismatch in MiniCPM-o-4_5 resampler (#43194) Signed-off-by: Yan Ma <yan.ma@intel.com> * [Frontend] Split the offline inference APIs and utils. (#43553) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Bugfix][Model] Fix GPT2ForSequenceClassification sub-module prefix (#43579) Signed-off-by: QingZhou-YangHY <3868850350@qq.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> * [GDN] GDN Prefill kernel for SM100 (#43273) Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg> * [CPU] Enable non-divisible GQA for decode workitems in mixed batches (#43032) Signed-off-by: zhejiangxiaomai <zhenhui.zhao@intel.com> * Upgrade tpu-inference to v0.20.0 (#43394) * Add CuTe DSL sparse compressor support (#43584) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: OpenAI Codex <codex@openai.com> Co-authored-by: Yongye Zhu <zyy1102000@gmail.com> * [chores][log] change registry log from `warning` to `debug` (#43045) Signed-off-by: Hank <hcc.mayday@gmail.com> * [Bugfix] Apply fc_norm in Eagle3DeepseekV2 combine_hidden_states (#43482) Signed-off-by: Yubo Wang <yubowang2019@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> * [KV Transfer] Enable HMA by default for connectors that support it (#41847) Signed-off-by: Ethan Feng <ethan.fengch@gmail.com> * [Misc][Refactor][ROCm] Convert MoRI-related envvars to extra config args (#43303) Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com> * [Misc] Support interleaved custom image benchmark datasets (#43636) Signed-off-by: ThibaultCastells <thib.castells@icloud.com> * [Reasoning] [Bugfix] Reject invalid thinking_token_budget values (#43402) Signed-off-by: linzm1007 <linzm1007@126.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [Model] Use AutoWeightsLoader for InternLM2 (#38278) Signed-off-by: Jesus De Jesus <dejesus.9297@gmail.com> Signed-off-by: javierdejesusda <javier.dejesusj9@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> * [XPU] Fix fused MoE LoRA kernel crash on XPU by using platform-agnos num_compute_units (#43646) Signed-off-by: Chaojun,Zhang <chaojun.zhang@intel.com> * Fix CuPy runtime deps and restore humming (#43530) Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> * [Docs][ROCm] MoRI-IO Connector Usage Guide (#43603) Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> Signed-off-by: Simon Danielsson <70206058+simondanielsson@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [ROCm][CI] Extend ROCm quick reduce coverage (#40990) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [Feat][DSV4] Fuse q pad into deepseek v4 fused kernel (#43162) * [MoE Refactor] Migrate ModelOptMxFp8FusedMoE to oracle (#42768) Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> * [MoE Refactor] W4a8 int8 oracle (#42789) Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> * [ROCm] Remove MegaMoE integration in deepseek v4 (#43629) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * Add LM head quantization support for ModelOpt (#42124) Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com> * [Doc] Add line limit to AGENTS.md (#43635) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com> * [DSv4] Drop _get_compressed_kv_buffer in DeepseekCompressor (#43690) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [CI] Soft-fail AMD entrypoints mirror tests (#43709) Signed-off-by: Kevin Luu <kevin@inferact.ai> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [Kernel] Porting fuse_minimax_qk_norm to manual fusion (#43410) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> * [KV Connector] MooncakeStore: drop dead discard_partial_chunks parameter (#43627) Signed-off-by: Zhewen Li <zhewen@inferact.ai> Co-authored-by: Zhewen Li <zhewen@inferact.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [Bugfix][V1] Fix TOCTOU race causing intermittent `EADDRINUSE` on multi-API-server DP startup (#42585) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [ci] Add arm64 ci image (#41303) Signed-off-by: khluu <khluu000@gmail.com> Signed-off-by: Kevin H. Luu <khluu000@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [Bugfix] Split attention groups by num_heads_q for spec-decode drafts (#43543) Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com> * [Rust Frontend] Add reasoning/tool parser & renderer roundtrip tests (#43582) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * [ROCm][CI] Fix ROCm multimodal Qwen2.5-VL activation compile and Phi4MM ragged image mask handling (#43647) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [Perf] Optimize Fp8BlockScaledMMLinearKernel input_scale tensor using new_empty() (#43677) Signed-off-by: Xin Yang <xyangx@amazon.com> * [Attention] Make FlexAttention and FlashAttention use num-blocks first layouts (#42095) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> * [MLA][Attention] Add OOT MLA prefill backend registration mechanism (#43325) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> * [Deprecation] Deprecate functions as scheduled for v0.21.0 (#43358) Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [DSv4] Refactor compressor & Fix ROCm compatibility (#43710) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * Fix test_aot_compile for torch 2.12 (#43695) Signed-off-by: Angela Yi <yiangela7@gmail.com> * [KVConnector][Mooncake] Wire reset_cache cascade end-to-end (#42694) Signed-off-by: aoshen524 <aoshen524@gmail.com> Signed-off-by: Ao Shen <aoshen@inferact.ai> Co-authored-by: aoshen524 <aoshen524@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [ROCm][Perf] Expose AITER MoE sorting dispatch policy via env var (#39177) Signed-off-by: nholmber <nholmber@users.noreply.github.com> * [MRV2][BugFix] Fix KV connector handling in spec decode case (#43719) Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> * [Frontend] Add MiniCPM5 XML tool call parser (#43175) Signed-off-by: zhangtao <zhangtao2@modelbest.cn> Signed-off-by: zhangtao2 <zhangtao2@modelbest.cn> Co-authored-by: zhangtao <zhangtao2@modelbest.cn> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> * [ROCm][GPT-OSS] Avoid repeated compile-time `cos_sin_cache.to(bf16)` casts in rotary path (#42833) Signed-off-by: Aakif Nawaz <aakif.nawaz@amd.com> * [Doc] Add Ascend NPU tab to the quickstart installation guide (#43550) Signed-off-by: Aditya Singh <adisin650@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [Rust Frontend] Align tool parser fallback behavior between streaming & non-streaming paths (#43662) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * [Docs] Fix MLA prefill backend default docs (#43697) Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> * [Kernel] Enable TritonW4A16LinearKernel as CUDA fallback for non-Marlin-aligned W4A16 shapes (#43731) Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com> * [Bugfix] Map reasoning_effort to enable_thinking in chat template kwargs (#43401) Signed-off-by: Ashwin Giridharan <girida@amazon.com> Signed-off-by: Chauncey <chaunceyjiang@gmail.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> * [misc] Bump cutedsl version to 4.5.2 (#43745) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> * [BugFix] HFValidationError with cloud storage URIs when HF_HUB_OFFLINE=1 (#39155) Signed-off-by: Injae Ryou <injaeryou@gmail.com> * [Docs] Fix the duplicate doc icon issue (#43546) Signed-off-by: chunyang.wen <chunyang.wen@gmail.com> * Fix early CUDA init (#43791) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [ROCm] mori: add InterNodeV1LL inter-node kernel selection via VLLM_MORI_INTERNODE_KERNEL (#41751) Signed-off-by: jatseng-ai <jatseng@amd.com> * [8/n] Migrate merge_attn_states, mamba, sampler to torch stable ABI (continued) (#43361) Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Signed-off-by: Chris Leonard <chleonar@redhat.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> * [Quantization] Fix Humming RoutedExperts import (#43540) Signed-off-by: Minh Vu <vuhoangminh97@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> * Remove Transformers forward/backward compatibility tests (#43785) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * Validate against some config fields being set to 0 (#43794) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Bugfix][DFlash]allocate the proper number of lookahead slots (#43733) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com> Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com> * Fix Qwen3-VL and Qwen3-omni-thinker accuracy degradation from deepstack inputs under torch.compile (#43617) Signed-off-by: Dakai An <dakaian108@gmail.com> * Add @AndreasKaratzas to CODEOWNERS (#43740) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [Bugfix][Kernel] TRTLLM NVFP4 MoE chunking (#43599) Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com> * [ModelRunnerV2][Hybrid model] Support kernel block size in hybrid model (#38831) Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> * [Rust Frontend] Introduce mock engine for benchmark baseline (#43469) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * Fix RunAI streamer tensor buffer reuse during weight loading (#43464) Signed-off-by: bbartels <benjamin@bartels.dev> * [MoE] Remove inplace fused experts mechanism (#43727) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> * [Misc][Rocm] Remove redundant `AiterUnifiedAttentionBackend` block size log (#43664) Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [ROCm][CI] Stabilize Cargo cache and pre-test image checks (#43815) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * fix: parse Qwen3 XML JSON arguments first (#43243) Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> Co-authored-by: Flora Feng <4florafeng@gmail.com> * [Bugfix] Pass `routed_scaling_factor` to FlashInfer TRTLLM BF16 MoE (#43769) * [BugFix] Fix blocked reasoning parsing with MRV2 (#43808) Signed-off-by: Nick Hill <nickhill123@gmail.com> * [Bugfix][Frontend] streaming tool-call serializer drops first args chunk when name and args share a DeltaMessage (#42683) Signed-off-by: ignaciosica <mignacio.sica@gmail.com> Signed-off-by: sfeng33 <4florafeng@gmail.com> Co-authored-by: sfeng33 <4florafeng@gmail.com> * minor docs: fix incorrect example path (#43830) Signed-off-by: JINO-ROHIT <find.jinorohit@gmail.com> * [ROCm][DSV4] Enable Tilelang MHC replacing torch/triton mhc (#43679) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> * change name of fs_python secondary tier to fs. (#43600) Signed-off-by: Rotem Shavitt <rshavitt@gmail.com> * [BugFix] Fix hard-coded timeout for multi-API-server startup (#43768) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> * [Kernel] Marlin MoE: include SM 12.x in default arch list (#40923) Signed-off-by: Tony Liu <tonyliu0512@gmail.com> Co-authored-by: Tony Liu <tonyliu0512@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> * [DSV4] Remove AMD/XPU path in deepseek_v4/nvidia (#43829) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * Restore `Literal` for `WeightTransferConfig.backend` (#43183) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Bugfix] Stream DeepSeek DSML tool-call argument deltas incrementally (#42879) Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> * [ROCm][CI] Move workload from MI300 to MI325 (#43824) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [Feature] Add support for timed trace replay in `vllm bench serve` to replay Moonshot and Alibaba workload traces (#39795) Signed-off-by: Animesh Trivedi <Animesh.Trivedi@ibm.com> * [UX] Increase DP Coordinator startup timeout from 30s to 120s (#42343) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> * [Model][Bugfix] Rename weight_mapper to hf_to_vllm_mapper in LlamaNemotronVL pooling models (#43581) Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com> Co-authored-by: opencode <noreply@opencode.ai> Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com> * [Bugfix][ROCm] Fix Accuracy Drop in Sparse Indexer on gfx950 (#43781) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com> * [Bugfix] Fix HyperCLOVAX CI failure after upstream removed remote code (#43860) Signed-off-by: Kevin Luu <kevin@inferact.ai> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [CI] Auto-apply `rust` label to relevant PRs (#43866) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * [Feature] Add structured output and effort support to Anthropic Messages API (#42396) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> * Log dummy DP step in iteration details (#41406) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> * [EC Connector] Add shutdown API to EC Connector. (#42423) Signed-off-by: omerpaz95 <omerpaz95@gmail.com> * Fix `OlmoHybridForCausalLM` not initialising (#43846) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [BUGFIX] Multimodal benchmark with MistralTokenizer (#42965) Signed-off-by: juliendenize <julien.denize@mistral.ai> Signed-off-by: Julien Denize <40604584+juliendenize@users.noreply.github.com> * [Perf] Optimize moe permute by pre-allocate buffer, 9~14% kernel performance improvement (#43014) Signed-off-by: yewentao256 <zhyanwentao@126.com> * [Perf][KDA] Fuse gate softplus, chunk-local cumsum, and RCP_LN2 scaling (#43667) Signed-off-by: haojiangzheng <justineric096@gmail.com> Co-authored-by: haojiangzheng <justineric096@gmail.com> * Add token-offset based selective offload in OffloadConnector (#39983) Signed-off-by: Angelo Ruocco <ang@zurich.ibm.com> Co-authored-by: Or Ozeri <or@ozery.com> * [Model Refactoring] Remove torch compile dependency in DSv4 (#43746) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [Bugfix][ROCm] Resolve MoRI connector hangs at high concurrency (#40344) Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> * [CPU] Migrate cpu_awq into awq_marlin (#43841) Signed-off-by: jiang1.li <jiang1.li@intel.com> * [Rust Frontend] Add `hy_v3` tool parser (#43872) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * [Rust Frontend] Reduce Gemma4 tool parser args scan complexity (#43850) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * [rust] fix: aggregate `is_sleeping` and `reset_prefix_cache` across DP engines (#43429) Signed-off-by: Will.hou <1205157517@qq.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [Bug] Fix `tests/distributed/test_elastic_ep.py - assert False` (#43813) Signed-off-by: yewentao256 <zhyanwentao@126.com> * [Perf] Add do_not_specialize to Mamba SSD chunk kernels (#43803) Signed-off-by: Majid Taheri Andani <tahemaji@amazon.com> Co-authored-by: Majid Taheri Andani <tahemaji@amazon.com> * [Bugfix] Exclude Ray DP from #42585's deferred port allocation (#43864) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> * [KV Offload] Rename `SecondaryTierManager.get_finished()` to `get_finished_jobs()` (#43870) Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com> * [ROCm][Perf] Support N=5 in wvSplitK skinny GEMM kernels for speculative decoding (#40687) Signed-off-by: Matthias Gehre <matthias.gehre@amd.com> * [XPU][MoE] Add WNA16 oracle backend for GPTQ sym-int4 (xpu_fused_moe) (#41426) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [ROCm] Bump ROCm to 7.2.3 (#43136) Signed-off-by: Micah Williamson <micah.williamson@amd.com> * Add Cosmos3 Reasoner model (#43356) Signed-off-by: Maciej Bala <mbala@nvidia.com> Signed-off-by: MaciejBalaNV <mbala@nvidia.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Isotr0py <2037008807@qq.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Roger Wang <hey@rogerw.io> * [Rust Frontend] Optimize multimodal prompt expansion (#43670) Signed-off-by: RickyChen / 陳昭儒 <ricky.chen@infinirc.com> * Allow native KV cache dtype in Triton cache update (#43330) Signed-off-by: Michael Gschwind <mgschwind@nvidia.com> Co-authored-by: Michael Gschwind <mgschwind@nvidia.com> * [Attention][AMD] Standardize kv layout to blocks first for AMD (#43660) Signed-off-by: NickLucche <nlucches@redhat.com> * [ROCm] Enable the aiter top-k/top-p sampler by default (#43331) Signed-off-by: John Qin <yanyuan.qin@amd.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com> * [MM][CG] Avoid over-padding Qwen2.5-VL encoder cudagraph window metadata (#42796) Signed-off-by: Hua Huang <huah@nvidia.com> * Deprecate `JAISLMHeadModel` (#43784) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Feat] Add support for per GPU worker RDMA NIC selection (#42083) Signed-off-by: Raj Joshi <rajjoshi@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com> * [Core] Cleanup KVConnector handling with PP + fix MRV2 (#43732) Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [KV Offload] Add per-request offloading policy via `on_new_request` lifecycle hook (#43205) Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com> Co-authored-by: Or Ozeri <or@ozery.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [Model Refactoring] Remove unncessary torch op registration for DSv4 (#43891) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [Spec Decode] Allow causal DFlash (#43445) * Refactor output filename handling in ci-fetch-log.sh (#43901) Signed-off-by: Michael Goin <mgoin64@gmail.com> * [AMD][CI][BugFix] Fix Distributed Compile Unit Tests (2xH100-2xMI300) group (#43120) Signed-off-by: Randall Smith <Randall.Smith@amd.com> * fix(frontend): Add multimodal placeholders to Gemma4 tool message template (#41459) Signed-off-by: Harshal Janjani <harshaljanjani@gmail.com> Co-authored-by: Ben Browning <bbrownin@redhat.com> * [CI] Enable prefix caching in BFCL benchmark (#43925) Signed-off-by: Yifan Zong <yzong@redhat.com> * [Model]Support Step-3.7-Flash (#43859) Signed-off-by: luotingdan <luotingdan@stepfun.com> Signed-off-by: Isotr0py <Isotr0py@outlook.com> Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> Co-authored-by: luotingdan <luotingdan@stepfun.com> Co-authored-by: Isotr0py <Isotr0py@outlook.com> Co-authored-by: Yu Huang <yuhuang@nvidia.com> Co-authored-by: Jee Jee Li <jeejeelee@inferact.ai> * [Rust Frontend] Add `/version` endpoint using engine-reported value (#43854) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * [Misc][NUMA] Auto-bind to PCT priority cores on DGX B300 + widen EngineCore across shard NUMA nodes (#43270) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Co-authored-by: Cursor <noreply@cursor.com> * [DSv4] Move mHC tilelang kernels & Don't use CustomOP in dsv4/nvidia (#43905) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [feat] add GlmgaProcessor specific logits in `glm4_1v.py` (#43575) Signed-off-by: JaredforReal <w13431838023@gmail.com> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: Isotr0py <Isotr0py@outlook.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Isotr0py <Isotr0py@outlook.com> * Adjust design around encoder_cudagraph_forward (#42288) Signed-off-by: Weida Hong <wdhongtw@google.com> * [XPU] add scale transpose to prepare_fp8_moe_layer_for_xpu and bump up kernels (#43277) Signed-off-by: mayuyuace <qiming1.zhang@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [kv_offload] Skip decode-phase blocks in CPU offload (#43797) Signed-off-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Itay Etelis <itay.etelis@ibm.com> * [Refactor] Remove dead code (#43234) Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [9/n] Migrate attention and cache kernels to torch stable ABI (continued) (#43717) Signed-off-by: Chris Leonard <chleonar@redhat.com> Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> * [CI] Separate non-root smoke tests from image build step (#43712) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [XPU] add gelu_tanh to xpu moe backend supported activations (#42822) Signed-off-by: yintong-lu <yintong.lu@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [CPU Backend] CPU top-k and top-p sampling kernels using Triton (#43633) Signed-off-by: Li, Tianmu <tianmu.li@intel.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [ROCm][DSv4] Remove device pipeline stall in sparse attention (#43898) Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * [Frontend]Responses API supports chat_template_kwargs (#43761) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> * [ROCm][CI] Fix AITER unified attention for encoder-decoder cross-attention (#43945) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [XPU] fix xpu install document triton-xpu version (#43947) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * [CI][ROCm] Don't skip MoRI-IO Connector tests (#43703) Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> * [XPU] support MTP of gdn attention (#43565) Signed-off-by: mayuyuace <qiming1.zhang@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [CI] Nixl+SimpleCPUOffloadingConnector unit tests (#43871) Signed-off-by: NickLucche <nlucches@redhat.com> * [Bugfix] Fix Step3 pipeline parallel KeyError for residual tensor (#37622) Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> * [Kernel][ROCm] Native W4A16 kernel for AMD RDNA3 (gfx1100) — fp16 + bf16 (#41394) Signed-off-by: JartX <sagformas@epdcenter.es> * [Bugfix] [ROCm] [DSV4] Fix AITER MXFP4 MoE weight loading and shuffle… (#42595) Co-authored-by: MHYangAMD <MHYangAMD@users.noreply.github.com> * [ROCm][Perf] DSv3.2 MI355X TP4 decode-step orchestration cleanup (3 micro-opts) (#42982) Signed-off-by: Frida Andersson <fanderss@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com> * [Bugfix] Corrupted MLA + linear attention (#43961) Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg> * Skip docs build if PR doesn't affect docs (#43972) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Bugfix][CPU] Remove invalid extra deps (#43977) Signed-off-by: jiang1.li <jiang1.li@intel.com> * Add vLLM library info to Hugging Face Hub requests (#43857) Signed-off-by: Wauplin <lucainp@gmail.com> Signed-off-by: Lucain Pouget <lucain@huggingface.co> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * docs: clarify ITL acronym in optimization docs (#43922) Signed-off-by: chunyang.wen <chunyang.wen@gmail.com> * [Misc] added unit tests for the core pooling methods (#43818) Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> * [Bugfix] Disable allreduce_rms_fusion when pipeline_parallel_size > 1 (#43616) Signed-off-by: zixi-qi <zixi@inferact.ai> Co-authored-by: Claude <noreply@anthropic.com> * [MoE Refactor] WNA16 MoE backend selection into oracle module (#42553) Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Claude <noreply@anthropic.com> * [EPLB] Make async EPLB default (#43219) Signed-off-by: Markov Ilya <markovilya19@gmail.com> Co-authored-by: Markov Ilya <markovilya19@gmail.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> * [Bugfix] Use storage_block_size in KV cache reshape for compressed specs (DeepSeek V4) (#43988) Signed-off-by: zixi-qi <zixi@inferact.ai> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * [Bugfix] Fix Ray placement group allocation with grouped nodes (#43998) Signed-off-by: <conway.zhu@cohere.com> Signed-off-by: root <conway.zhu@cohere.com> * [Bug] Fix torch device issue for MOE permute (#44005) Signed-off-by: yewentao256 <zhyanwentao@126.com> * [CI] Make Model Executor test hangs fail fast with a traceback (#43971) Signed-off-by: khluu <khluu000@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> * [CI] Remove redundant test_chat_with_tool_reasoning.py (#44011) Signed-off-by: sfeng33 <4florafeng@gmail.com> * Add @khluu to CODEOWNERS (#44019) Signed-off-by: Kevin H. Luu <khluu000@gmail.com> * [Feature] SSL support for dp supervisor (#43688) Signed-off-by: yewentao256 <zhyanwentao@126.com> * [Metrics] Exclude KV transfer tokens from iteration_tokens_total (#43346) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [Fronten] Clean up stop_token_ids override for Harmony (#44009) Signed-off-by: Yifan Zong <yzong@redhat.com> * [MoE Refactor] Migrate MoeWNA16Method quantization to MK oracle (#42647) Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Claude <noreply@anthropic.com> * [MoE Refactor] Remove supports_expert_map (#43108) Signed-off-by: Bill Nell <bnell@redhat.com> * [CI] Remove duplicate Harmony test coverage (#44023) Signed-off-by: sfeng33 <4florafeng@gmail.com> * [CI] Fix smoke test step key to bypass block gate (#43974) Signed-off-by: khluu <khluu000@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Revert "[MoE Refactor] Migrate MoeWNA16Method quantization to MK orac… (#44033) Signed-off-by: Bill Nell <bnell@redhat.com> * [PERF]MiniMax-M2 gate kernel (#38445) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: qianlihuang <91178480+qianlihuang@users.noreply.github.com> Co-authored-by: Yiliu Dong <91178480+qianlihuang@users.noreply.github.com> * offload prompt_embeds decode in render_prompts_async to avoid blocking (#43792) Signed-off-by: Gagan Dhakrey <gagandhakrey@gmail.com> * [Refactor] Remove dead current_tool_name_sent assignments from tool parsers (#43997) Signed-off-by: sfeng33 <4florafeng@gmail.com> * [ROCm][CI] Fix failure in the Phi3V pooling test (#44028) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [ROCm] cmake: support PYTORCH_FOUND_HIP for torch 2.13 native HIP language support (#43881) Signed-off-by: nemanjaudovic <nudovic@amd.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> * [BugFix][Platform] Fix import vllm.platforms.rocm error on non-CUDA test_gpt_oss.py (#43571) Signed-off-by: Ma, Liangliang <liangliang.ma@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [Bugfix] Fix RMSNorm kernels to multiply in weight's native dtype (#42379) Signed-off-by: Lanze Liu <lanzetech@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [ROCm] Add attention sink support to AITer flash attention backend (#43817) Signed-off-by: Xiaoran Chen <xiaoran@fb.com> Co-authored-by: Xiaoran Chen <xiaoran@fb.com> * [Governance] Add @BugenZhao as Rust frontend code owner (#44047) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * [Bug] Fix gemma4 MTP IMA issue when TP>1, `CUDA error: an illegal memory access was encountered` (#43909) Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [MRV2] Support breakable CUDA graph (#44050) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [CPU][Zen] Route W8A8 and W4A16 linear inference through zentorch on AMD Zen CPUs (#41813) Signed-off-by: R <Ganesh.R@amd.com> Signed-off-by: Harshal Adhav <harshal.adhav@amd.com> Signed-off-by: Aakar Dwivedi <aadwived@amd.com> Co-authored-by: R <Ganesh.R@amd.com> Co-authored-by: Harshal Adhav <harshal.adhav@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> * [CI/Build] Enable Step3p7ForConditionalGeneration testing (#43956) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> * docs: fix MLA attention docstring examples (#44118) Co-authored-by: nightcityblade <nightcityblade@gmail.com> * [Misc] Use VLLMValidationError consistently in chat completion and completion protocol validators (#36254) Signed-off-by: umut-polat <52835619+umut-polat@users.noreply.github.com> * [MRV2] Remove Eagle's dedicated CUDA graph pool (#44078) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> * [BugFix] Fix `_has_module` to verify native deps via trial import (#44035) Signed-off-by: esmeetu <jasonailu87@gmail.com> Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: esmeetu <jasonailu87@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> * [Docs] Replace broken video url in examples (#44159) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> * [CPU][RISC-V] Add missing RVV cpu_types helpers for WNA16 (#42730) Signed-off-by: wcy <233313160abc@gmail.com> Co-authored-by: Li, Jiang <jiang1.li@intel.com> * fix: glm5.1 pp model loading (#42944) Signed-off-by: UranusSeven <109661872+UranusSeven@users.noreply.github.com> * [Frontend] Resettle generative scoring entrypoint. (#44153) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> * [Rust Frontend] Add InternLM2 tool parser (#43481) Signed-off-by: Will.hou <1205157517@qq.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Bugen Zhao <i@bugenzhao.com> * [Bugfix] fix wrong partial_rotary_factor calculation for bailing_moe model. (#43770) Signed-off-by: zzt <zengzetang.zzt@antgroup.com> Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com> * [XPU][CI] Fix test_audio_in_video flake by using module-scoped server fixture (#44146) Signed-off-by: Chaojun Zhang <chaojun.zhang@intel.com> * [Perf] Optimize cutlass fp8 scaled mm bypassing padding, 20% kernel performance improvement (#43706) Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [Feature] Add support for JetBrains' Mellum v2 code generation model (#43992) Signed-off-by: Madeesh Kannan <madeeswaran.kannan@jetbrains.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> * [Kernel][DSv4] Optimize sparse FP8 compressor kernels (#44161) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> * [ROCm][CI] Fix and stabilize EAGLE3 acceptance tests (#41294) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Micah Williamson <micah.williamson@amd.com> Co-authored-by: Micah Williamson <micah.williamson@amd.com> * [Rust Frontend] Support streaming `generate` endpoint (#43779) Signed-off-by: xunzhuo <xunzhuo@vllm-semantic-router.ai> Co-authored-by: Bugen Zhao <i@bugenzhao.com> * [Frontend][Core] Add sparse NCCL weight transfer support for in-place updates (#40096) Signed-off-by: Siddharth Bedekar <bedeksid@gmail.com> Co-authored-by: OpenAI Codex <codex@openai.com> * [BugFix][CI] Fix added `_has_module` tests (#44248) Signed-off-by: Nick Hill <nickhill123@gmail.com> * [Test][BugFix] Fix double-BOS in PD+specdec acceptance test (#44234) Signed-off-by: Nick Hill <nickhill123@gmail.com> * [DSV4] Remove unncessary classes & functions (#44246) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [ROCm][CI] Skip unbacked dynamic shapes tests on PyTorch < 2.11 (#44256) Signed-off-by: JartX <sagformas@epdcenter.es> * [DSV4] Refactor RoPE initialization (#44262) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [Bugfix][Mooncake] Release GPU pin on failed store in MooncakeStoreConnector (#43742) Signed-off-by: Dao Le <Dao007forever@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> * [ROCm] Upgrade AITER to v0.1.13.post1 (#44265) Signed-off-by: Micah Williamson <micah.williamson@amd.com> * [Bugfix][CI] Normalize NIXL connector CUDA wheel installs (#44266) Signed-off-by: Alec Flowers <aflowers@nvidia.com> * [Refactor] Move unstreamed tool-arg flush from serving layer to parser (#44017) Signed-off-by: sfeng33 <4florafeng@gmail.com> * [CI] Stabilize OpenAI schema fuzzing for malformed structural tags (#44131) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [BugFix] Fix TypeError in MiniCPM-O audio feature unpadding (#38053) Signed-off-by: Krishna Chaitanya Balusu <krishnabkc15@gmail.com> Signed-off-by: wjinxu <1299461899@qq.com> Signed-off-by: Kc Balusu <kcbalusu@users.noreply.github.com> Co-authored-by: wjinxu <1299461899@qq.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Kc Balusu <kcbalusu@users.noreply.github.com> * [BugFix][kv_offload]: Prevent offloading stale sliding window blocks (#42959) Signed-off-by: Or Ozeri <oro@il.ibm.com> * [XPU][Bugfix] Fix per_token_group_fp8_quant missing dummy args on XPU (#43930) Signed-off-by: Chaojun,Zhang <chaojun.zhang@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [MM][CG] Profile encoder CUDA graph pool memory (#41714) Signed-off-by: JooHo Lee <jooho414@gmail.com> * [Bugfix] Convert Gemma4-MM ViT linear layers to vllm native impl (#43798) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: ZiTian Zhao <zitian.zhao@tencentmusic.com> Co-authored-by: B-201 <Joy25810@foxmail.com> * [Model Runner V2] Support zeroing freshly allocated KV blocks for hybrid + fp8 KVCache (#43990) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> * [Model Runner V2] Use actual batch max_seq_len for attn metadata (#43991) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> * [Refactor] Unify reasoning + tool-call parsing behind Parser.parse() (#44267) Signed-off-by: sfeng33 <4florafeng@gmail.com> --------- Signed-off-by: Hua Huang <huah@nvidia.com> Signed-off-by: holegots <ikun3.1415927@gmail.com> Signed-off-by: Siddharth Bedekar <bedeksid@gmail.com> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: sfeng33 <4florafeng@gmail.com> Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Dao Le <daole@inferact.ai> Signed-off-by: Dao Le <Dao007forever@gmail.com> Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Banani Ghosh <bg2502@nyu.edu> Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com> Signed-off-by: Rotem Shavitt <rshavitt@gmail.com> Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> Signed-off-by: weizhou.lan@daocloud.io <weizhou.lan@daocloud.io> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: TheDuyIT <nduy250299@gmail.com> Signed-off-by: dtnguyen <dtnguyen@nvidia.com> Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> Signed-off-by: esmeetu <jasonailu87@gmail.com> Signed-off-by: Yihuki <wangbovbvb@gmail.com> Signed-off-by: Zhewen Li <zhewenli@inferact.ai> Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: chaojun-zhang <chaojun.zhang@intel.com> Signed-off-by: Chaojun Zhang <chaojun.zhang@intel.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> Signed-off-by: Yan Ma <yan.ma@intel.com> Signed-off-by: QingZhou-YangHY <3868850350@qq.com> Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg> Signed-off-by: zhejiangxiaomai <zhenhui.zhao@intel.com> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Signed-off-by: Hank <hcc.mayday@gmail.com> Signed-off-by: Yubo Wang <yubowang2019@gmail.com> Signed-off-by: Ethan Feng <ethan.fengch@gmail.com> Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> Signed-off-by: ThibaultCastells <thib.castells@icloud.com> Signed-off-by: linzm1007 <linzm1007@126.com> Signed-off-by: Jesus De Jesus <dejesus.9297@gmail.com> Signed-off-by: javierdejesusda <javier.dejesusj9@gmail.com> Signed-off-by: Chaojun,Zhang <chaojun.zhang@intel.com> Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> Signed-off-by: Simon Danielsson <70206058+simondanielsson@users.noreply.github.com> Signed-off-by: Bill Nell <bnell@redhat.com> Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Kevin Luu <kevin@inferact.ai> Signed-off-by: Zhewen Li <zhewen@inferact.ai> Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Signed-off-by: khluu <khluu000@gmail.com> Signed-off-by: Kevin H. Luu <khluu000@gmail.com> Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Signed-off-by: Bugen Zhao <i@bugenzhao.com> Signed-off-by: Xin Yang <xyangx@amazon.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Angela Yi <yiangela7@gmail.com> Signed-off-by: aoshen524 <aoshen524@gmail.com> Signed-off-by: Ao Shen <aoshen@inferact.ai> Signed-off-by: nholmber <nholmber@users.noreply.github.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: zhangtao <zhangtao2@modelbest.cn> Signed-off-by: zhangtao2 <zhangtao2@modelbest.cn> Signed-off-by: Aakif Nawaz <aakif.nawaz@amd.com> Signed-off-by: Aditya Singh <adisin650@gmail.com> Signed-off-by: Ashwin Giridharan <girida@amazon.com> Signed-off-by: Chauncey <chaunceyjiang@gmail.com> Signed-off-by: Injae Ryou <injaeryou@gmail.com> Signed-off-by: chunyang.wen <chunyang.wen@gmail.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: jatseng-ai <jatseng@amd.com> Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Signed-off-by: Chris Leonard <chleonar@redhat.com> Signed-off-by: Minh Vu <vuhoangminh97@gmail.com> Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com> Signed-off-by: Dakai An <dakaian108@gmail.com> Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Signed-off-by: bbartels <benjamin@bartels.dev> Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> Signed-off-by: ignaciosica <mignacio.sica@gmail.com> Signed-off-by: JINO-ROHIT <find.jinorohit@gmail.com> Signed-off-by: Tony Liu <tonyliu0512@gmail.com> Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Signed-off-by: Animesh Trivedi <Animesh.Trivedi@ibm.com> Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com> Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> Signed-off-by: omerpaz95 <omerpaz95@gmail.com> Signed-off-by: juliendenize <julien.denize@mistral.ai> Signed-off-by: Julien Denize <40604584+juliendenize@users.noreply.github.com> Signed-off-by: haojiangzheng <justineric096@gmail.com> Signed-off-by: Angelo Ruocco <ang@zurich.ibm.com> Signed-off-by: jiang1.li <jiang1.li@intel.com> Signed-off-by: Will.hou <1205157517@qq.com> Signed-off-by: Majid Taheri Andani <tahemaji@amazon.com> Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com> Signed-off-by: Matthias Gehre <matthias.gehre@amd.com> Signed-off-by: Micah Williamson <micah.williamson@amd.com> Signed-off-by: Maciej Bala <mbala@nvidia.com> Signed-off-by: MaciejBalaNV <mbala@nvidia.com> Signed-off-by: RickyChen / 陳昭儒 <ricky.chen@infinirc.com> Signed-off-by: Michael Gschwind <mgschwind@nvidia.com> Signed-off-by: John Qin <yanyuan.qin@amd.com> Signed-off-by: Raj Joshi <rajjoshi@redhat.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Randall Smith <Randall.Smith@amd.com> Signed-off-by: Harshal Janjani <harshaljanjani@gmail.com> Signed-off-by: Yifan Zong <yzong@redhat.com> Signed-off-by: luotingdan <luotingdan@stepfun.com> Signed-off-by: Isotr0py <Isotr0py@outlook.com> Signed-off-by: JaredforReal <w13431838023@gmail.com> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: Weida Hong <wdhongtw@google.com> Signed-off-by: mayuyuace <qiming1.zhang@intel.com> Signed-off-by: Itay Etelis <itay.etelis@ibm.com> Signed-off-by: yintong-lu <yintong.lu@intel.com> Signed-off-by: Li, Tianmu <tianmu.li@intel.com> Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> Signed-off-by: JartX <sagformas@epdcenter.es> Signed-off-by: Frida Andersson <fanderss@amd.com> Signed-off-by: Wauplin <lucainp@gmail.com> Signed-off-by: Lucain Pouget <lucain@huggingface.co> Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com> Signed-off-by: zixi-qi <zixi@inferact.ai> Signed-off-by: Markov Ilya <markovilya19@gmail.com> Signed-off-by: <conway.zhu@cohere.com> Signed-off-by: root <conway.zhu@cohere.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Signed-off-by: qianlihuang <91178480+qianlihuang@users.noreply.github.com> Signed-off-by: Gagan Dhakrey <gagandhakrey@gmail.com> Signed-off-by: nemanjaudovic <nudovic@amd.com> Signed-off-by: Ma, Liangliang <liangliang.ma@intel.com> Signed-off-by: Lanze Liu <lanzetech@gmail.com> Signed-off-by: Xiaoran Chen <xiaoran@fb.com> Signed-off-by: R <Ganesh.R@amd.com> Signed-off-by: Harshal Adhav <harshal.adhav@amd.com> Signed-off-by: Aakar Dwivedi <aadwived@amd.com> Signed-off-by: umut-polat <52835619+umut-polat@users.noreply.github.com> Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: wcy <233313160abc@gmail.com> Signed-off-by: UranusSeven <109661872+UranusSeven@users.noreply.github.com> Signed-off-by: zzt <zengzetang.zzt@antgroup.com> Signed-off-by: Madeesh Kannan <madeeswaran.kannan@jetbrains.com> Signed-off-by: xunzhuo <xunzhuo@vllm-semantic-router.ai> Signed-off-by: Alec Flowers <aflowers@nvidia.com> Signed-off-by: Krishna Chaitanya Balusu <krishnabkc15@gmail.com> Signed-off-by: wjinxu <1299461899@qq.com> Signed-off-by: Kc Balusu <kcbalusu@users.noreply.github.com> Signed-off-by: JooHo Lee <jooho414@gmail.com> Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Signed-off-by: Hynek Kydlicek <kydlicek.hynek@gmail.com> Co-authored-by: Hua Huang <huangh1994@outlook.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Holegots <fuergaosi@gmail.com> Co-authored-by: Siddharth Bedekar <104613085+bedeks@users.noreply.github.com> Co-authored-by: OpenAI Codex <codex@openai.com> Co-authored-by: Dao007forever <dao007forever@gmail.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Flora Feng <4florafeng@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Andreas Karatzas <akaratza@amd.com> Co-authored-by: danisereb <daserebrenik@nvidia.com> Co-authored-by: Banani Ghosh <bg2502@nyu.edu> Co-authored-by: Rotem Shavitt <rshavitt@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: weizhoublue <45163302+weizhoublue@users.noreply.github.com> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Nguyễn Thế Duy <dtnguyen@nvidia.com> Co-authored-by: Yifan Qiao <yifanqiao@inferact.ai> Co-authored-by: Roy Wang <jasonailu87@gmail.com> Co-authored-by: Yihuki <wangbovbvb@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Zhewen Li <zhewenli@meta.com> Co-authored-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Co-authored-by: Chaojun Zhang <chaojun.zhang@intel.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Yan Ma <yan.ma@intel.com> Co-authored-by: Huanyu Yang <20242081160@mail.dlut.edu.cn> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg> Co-authored-by: zhao, zhenhui <zhenhui.zhao@intel.com> Co-authored-by: Sting Lin <sting.lin@cienet.com> Co-authored-by: Jie Fang <jief@nvidia.com> Co-authored-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Hank_ <37239608+ILikeIneine@users.noreply.github.com> Co-authored-by: Yubo Wang <yubowang2019@gmail.com> Co-authored-by: Ethan Feng <ethan.fengch@gmail.com> Co-authored-by: Simon Danielsson <70206058+simondanielsson@users.noreply.github.com> Co-authored-by: Thibault Castells <38716394+ThibaultCastells@users.noreply.github.com> Co-authored-by: linzm1007 <96732179+linzm1007@users.noreply.github.com> Co-authored-by: Javier De Jesus <javier.dejesusj9@gmail.com> Co-authored-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> Co-authored-by: bnellnm <49004751+bnellnm@users.noreply.github.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Wei-Ming Chen <17592131+meenchen@users.noreply.github.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> Co-authored-by: Zhewen Li <zhewen@inferact.ai> Co-authored-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Co-authored-by: Luciano Martins <22145370+lucianommartins@users.noreply.github.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Bugen Zhao <i@bugenzhao.com> Co-authored-by: Xin Yang <105740670+xyang16@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Angela Yi <yiangela7@gmail.com> Co-authored-by: aoshen02 <aoshen@inferact.ai> Co-authored-by: aoshen524 <aoshen524@gmail.com> Co-authored-by: Nico Holmberg <nico.holmberg@amd.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: zhangtao2-1 <478679312@qq.com> Co-authored-by: zhangtao <zhangtao2@modelbest.cn> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> Co-authored-by: akii96 <aakif.nawaz@amd.com> Co-authored-by: Aditya Singh <60082699+adityasingh2400@users.noreply.github.com> Co-authored-by: Ashwin Giridharan <ashwing@users.noreply.github.com> Co-authored-by: Injae Ryou <injaeryou@gmail.com> Co-authored-by: Chunyang Wen <chunyang.wen@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: jatseng-ai <jatseng@amd.com> Co-authored-by: Chris Leonard <chleonar@redhat.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> Co-authored-by: Minh Vu <vuhoangminh97@gmail.com> Co-authored-by: Benjamin Chislett <bchislett@nvidia.com> Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com> Co-authored-by: Dakai An <77474977+andakai@users.noreply.github.com> Co-authored-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com> Co-authored-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Benjamin Bartels <benjamin@bartels.dev> Co-authored-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> Co-authored-by: Ignacio Sica <mignacio.sica@gmail.com> Co-authored-by: JINO ROHIT <find.jinorohit@gmail.com> Co-authored-by: tonyliu312 <56969792@qq.com> Co-authored-by: Tony Liu <tonyliu0512@gmail.com> Co-authored-by: jack <QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: Animesh Trivedi <animesh.trivedi@gmail.com> Co-authored-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com> Co-authored-by: Jakub Zakrzewski <jzakrzewski@nvidia.com> Co-authored-by: opencode <noreply@opencode.ai> Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com> Co-authored-by: kliuae <17350011+kliuae@users.noreply.github.com> Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com> Co-authored-by: omerpaz95 <73347585+omerpaz95@users.noreply.github.com> Co-authored-by: Julien Denize <40604584+juliendenize@users.noreply.github.com> Co-authored-by: zexplorerhj <zhjoneson@163.com> Co-authored-by: haojiangzheng <justineric096@gmail.com> Co-authored-by: Angelo Ruocco <angeloruocco90@gmail.com> Co-authored-by: Or Ozeri <or@ozery.com> Co-authored-by: Li, Jiang <jiang1.li@intel.com> Co-authored-by: Will.hou <1205157517@qq.com> Co-authored-by: Majid <mjtaheri68@gmail.com> Co-authored-by: Majid Taheri Andani <tahemaji@amazon.com> Co-authored-by: Ronen Schaffer <ronen.schaffer@ibm.com> Co-authored-by: Matthias Gehre <matthias.gehre@amd.com> Co-authored-by: Jason Elie Bou Kheir <5115126+jasonboukheir@users.noreply.github.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: Micah Williamson <micah.williamson@amd.com> Co-authored-by: MaciejBalaNV <mbala@nvidia.com> Co-authored-by: Isotr0py <2037008807@qq.com> Co-authored-by: Roger Wang <hey@rogerw.io> Co-authored-by: Chao-Ju Chen <ricky.chen@infinirc.com> Co-authored-by: Mike G <180722391+mikekg@users.noreply.github.com> Co-authored-by: Michael Gschwind <mgschwind@nvidia.com> Co-authored-by: JohnQinAMD <yanyuan.qin@amd.com> Co-authored-by: Hua Huang <huah@nvidia.com> Co-authored-by: Raj Joshi <rajjoshi@g.harvard.edu> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: rasmith <Randall.Smith@amd.com> Co-authored-by: Harshal Janjani <harshaljanjani@gmail.com> Co-authored-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: yzong-rh <yzong@redhat.com> Co-authored-by: ltd0924 <32387785+ltd0924@users.noreply.github.com> Co-authored-by: luotingdan <luotingdan@stepfun.com> Co-authored-by: Isotr0py <Isotr0py@outlook.com> Co-authored-by: Yu Huang <yuhuang@nvidia.com> Co-authored-by: Jee Jee Li <jeejeelee@inferact.ai> Co-authored-by: Cursor <noreply@cursor.com> Co-authored-by: Jared Wen <w13431838023@gmail.com> Co-authored-by: Weida Hong <wdhongtw@google.com> Co-authored-by: Qiming Zhang <qiming1.zhang@intel.com> Co-authored-by: Itay Etelis <92247226+Etelis@users.noreply.github.com> Co-authored-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Yintong Lu <yintong.lu@intel.com> Co-authored-by: Tianmu Li <tianmu.li@intel.com> Co-authored-by: Joaquín Mondéjar <111321569+JMonde@users.noreply.github.com> Co-authored-by: JartX <sagformas@epdcenter.es> Co-authored-by: MHYangAMD <meng-hsuan.yang@amd.com> Co-authored-by: MHYangAMD <MHYangAMD@users.noreply.github.com> Co-authored-by: frida-andersson <fanderss@amd.com> Co-authored-by: Lucain <lucainp@gmail.com> Co-authored-by: Taneem Ibrahim <taneem.ibrahim@gmail.com> Co-authored-by: qizixi <22851944+zixi-qi@users.noreply.github.com> Co-authored-by: Ilya Markov <markovilya197@gmail.com> Co-authored-by: Markov Ilya <markovilya19@gmail.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: czhu-cohere <conway.zhu@cohere.com> Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Yiliu Dong <91178480+qianlihuang@users.noreply.github.com> Co-authored-by: Gagan Dhakrey <59848316+gagandhakrey@users.noreply.github.com> Co-authored-by: nemanjaudovic <152565955+nemanjaudovic@users.noreply.github.com> Co-authored-by: Liangliang Ma <liangliang.ma@intel.com> Co-authored-by: Lanze Liu <86434077+liulanze@users.noreply.github.com> Co-authored-by: Xiaoran <claire.rrchen@hotmail.com> Co-authored-by: Xiaoran Chen <xiaoran@fb.com> Co-authored-by: Aakar Dwivedi <82587125+aadwived@users.noreply.github.com> Co-authored-by: R <Ganesh.R@amd.com> Co-authored-by: Harshal Adhav <harshal.adhav@amd.com> Co-authored-by: nightcityblade <jackchen@haloailabs.com> Co-authored-by: nightcityblade <nightcityblade@gmail.com> Co-authored-by: Umut Polat <52835619+umut-polat@users.noreply.github.com> Co-authored-by: Jeffrey Wang <jeffreywang@anyscale.com> Co-authored-by: wcy <86111164+wcynb1023@users.noreply.github.com> Co-authored-by: Uranus <109661872+UranusSeven@users.noreply.github.com> Co-authored-by: zzt <mf1732009@smail.nju.edu.cn> Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com> Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> Co-authored-by: Xunzhuo <xunzhuo@vllm-semantic-router.ai> Co-authored-by: Alec <35311602+alec-flowers@users.noreply.github.com> Co-authored-by: Krishna Chaitanya <krishnabkc15@gmail.com> Co-authored-by: wjinxu <1299461899@qq.com> Co-authored-by: Kc Balusu <kcbalusu@users.noreply.github.com> Co-authored-by: JooHo Lee <96564470+BWAAEEEK@users.noreply.github.com> Co-authored-by: ZiTian Zhao <zitian.zhao@tencentmusic.com> Co-authored-by: B-201 <Joy25810@foxmail.com> Co-authored-by: zhrrr <43847754+izhuhaoran@users.noreply.github.com>

* [XPU] add gptq(int4) support (#37844) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * [UX] Add a persistent cache for FlashInfer autotuning (#42537) Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> * [Bugfix][MRV2] Fix KVCache tensor explicit `kernel_block_size` dim (#42766) Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> * [Model Refactoring] Move DeepSeek V4 layers to `models/deepseek_v4/` [2/N] (#43039) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * add cutedsl dsv4 indexer fp8 kernel (#42899) Signed-off-by: george <george@inferact.ai> Co-authored-by: george <george@inferact.ai> * [Bugfix][KV Connector] Fix SimpleCPUOffloadScheduler TOCTOU between Phase A and Phase B (#42289) Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: gemini-code-assist <noreply@google.com> * [ci] Route 28 gpu_1_queue tests to h200_35gb queue (#43030) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: use keyword arguments for shard_id and expert_id in weight_loade… (#42671) Signed-off-by: junyanxu <junyanxu5513@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [Docs] Add SVG images for pooling models. (#42626) Signed-off-by: Gracie Guo <gracieguo@Gracies-MacBook-Pro.local> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Co-authored-by: Gracie Guo <gracieguo@Gracies-MacBook-Pro.local> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io> * [XPU] Use custom op collective behavior (#41354) Signed-off-by: Chaojun,Zhang <chaojun.zhang@intel.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [Misc] Aligning tokwise pooler heads for consistency (#43041) Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com> * [Docs] Reorganize online serving docs. (#41907) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Frontend] Consolidate beam search by BeamSearchMixin. (#42946) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> * [Model Refactoring] Move deepseek_v4_ops to models/deepseek_v4 [3/N] (#43073) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [bug] AsyncScheduler drops first post-resume token after pause_generation + clear_cache (#42117) Signed-off-by: hao-aaron <ahao@anyscale.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [KVConnector][DSV4] HMA support for Mooncake store connector (#42828) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> * [Model Refactoring] Rename deepseek_v4.py to model.py [4/N] (#43077) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [Misc][MM] Remove redundant code in CLIPAttention (#43046) Signed-off-by: shen-shanshan <467638484@qq.com> * [CI] Add MTP + PD disagg test for Qwen3.5 (#42677) Signed-off-by: ZhanqiuHu <zhu@redhat.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> * [Bugfix] Fix top logprobs token placeholders in `/inference/v1/generate` (#42887) Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> * [Perf][4/n] Eliminate various GPU<->CPU syncs (#42347) Signed-off-by: Nick Hill <nickhill123@gmail.com> * [XPU] update xpu graph usage (#43043) Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com> * [Model] Openvla support (#42654) Signed-off-by: Wang Yiwen <121547057+yiwen101@users.noreply.github.com> * [Refactor] Extract extract_types_from_schema utility from Minimax M2 tool parser (#43025) Signed-off-by: sfeng33 <4florafeng@gmail.com> * [Misc] add humming to dependencies (#42540) Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com> * [feat] Add FP8 per-tensor Q scale support to Triton attention backend (#42080) Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> * [Docs] Fix MooncakeStoreConnector role in disaggregated example (#42994) Signed-off-by: Dao Le <Dao007forever@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> * [Bugfix][MoE] FlashInfer one-sided: workspace union across heterogeneous layers (#42976) Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> * [CI failure] Temporarily disable using persistent cache for flashinfer autotune (#43119) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [ci] Move language models tests (hybrid) back to L4 (#43129) Signed-off-by: Kevin H. Luu <khluu000@gmail.com> * [Model] Support post-norm architecture for EAGLE-3 supeculators (#42764) Signed-off-by: Doğaç Eldenk <dogacel@gmail.com> * Fix error in Dynamic NTK scaling (#41277) Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: Max de Bayser <maxdebayser@gmail.com> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io> * [CPU][DOC] Fix installation commands for Arm CPUs (#43115) Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com> * [bug] fix WeightTransferConfig.backend to allow for all strings (#43121) Signed-off-by: ahao-anyscale <ahao@anyscale.com> * [MRV2][BugFix] Fix default-stream CG capture in P/W LoRA case (#43160) Signed-off-by: Nick Hill <nickhill123@gmail.com> * [Cohere] Enable Cohere MoE (#43143) Signed-off-by: Terrencezzj <terrence@cohere.ai> * [Perf][Bugfix] Update dflash aux layer indexing (#40727) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> * add enqueue all option to throughput benchmark (#42975) Signed-off-by: Philip Maybank <pmaybank@amd.com> Signed-off-by: pmaybank <113125070+pmaybank@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [Perf] Avoid forward scan for async output placeholders (#42938) * [CI] Add DSV4-Flash to gsm8k moe-refactor/config-b200.txt (#42111) Signed-off-by: mgoin <mgoin64@gmail.com> * [KV Offload] Pass `OffloadingSpec` instead of `VllmConfig` to secondary tiers (#43076) Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com> * [ci] Revert model executor test back to L4 (#43188) Signed-off-by: Kevin H. Luu <khluu000@gmail.com> * [Docs][PD][NIXL] Lease extension mechanism for blocks on P (#43099) Signed-off-by: NickLucche <nlucches@redhat.com> * [Docs][PD][NIXL] Bidirectional kv-cache transfer (#43097) Signed-off-by: NickLucche <nlucches@redhat.com> * [6/n] Migrate activation kernels, gptq, gguf, non cutlass w8a8 to libtorch stable ABI (continued) (#42663) Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Signed-off-by: Chris Leonard <chleonar@redhat.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> * Enable mermaid diagrams in the docs (#43192) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [GDN] Enable FI Blackwell GDN prefill kernel (#40717) Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com> * [XPU][CI] Add 2 server model test files in Intel GPU CI (#42499) Signed-off-by: zengxian <xiangdong.zeng@intel.com> * [Frontend] Forward X-data-parallel-rank header on /inference/v1/generate (#42330) Signed-off-by: hallerite <git@hallerite.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [Doc] Sync CLI guide with actual help modes and launch subcommand (#40326) Signed-off-by: Rui Wang <raygorous@gmail.com> Co-authored-by: Rui Wang <raygorous@gmail.com> * [Feature] Support manually enabling the cumem allocator (#33648) Signed-off-by: Kebe <mail@kebe7jun.com> * [Spec Decode] Support non-MTP speculation for NemotronH (#43130) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> * Remove additional dead code as a follow-up to #42889 (#43144) Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com> * [Bug][Structured Outputs] Fix bug that leads to unconstrained generations with structural tags (#42452) Signed-off-by: rishitdholakia13 <rishit+github@cohere.com> Co-authored-by: Cursor <cursoragent@cursor.com> * [Bugfix] Use enable_sm120_family for per-tensor FP8 CUTLASS kernels on SM12.1 (#41215) Signed-off-by: j9smith <j.smith9103@outlook.com> Signed-off-by: Joel Smith <j.smith9103@outlook.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> * [Bugfix] Use shared coerce_to_schema_type in DeepSeekV32 tool parser (#43019) Signed-off-by: sfeng33 <4florafeng@gmail.com> * [MISC] Fix symm_mem cap-equal gate; log AR backend selection (#42993) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> * [R3] Add routed experts to openai entrypoint (#38939) Signed-off-by: ahao-anyscale <ahao@anyscale.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> * [CI] Lower granite-4.0-h-tiny gsm8k threshold for Hybrid SSM NixlConnector PD accuracy tests (4 GPUs) (#43186) Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: NickLucche <nlucches@redhat.com> * Integrate flashinfer b12x MoE and FP4 GEMM kernels for SM120/121 (#40082) Signed-off-by: Meenakshi Venkataraman <meenakshiv@nvidia.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> * [Perf] Optimize `CutlassFP8ScaledMMLinearKernel` when padding needed by pre-weight processing, 13.5% TTFT improvement (#42651) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> * [Bugfix][CI] Add missing import of pad_nvfp4_activation_for_cutlass in flashinfer (#43237) Signed-off-by: sfeng33 <4florafeng@gmail.com> * Add dllehr-amd to CODEOWNERS and committers list (#42772) Signed-off-by: Douglas Lehr <Doug.Lehr@amd.com> * [Perf][gpt-oss] Downgrade triton_kernels to v3.5.1 (#43135) Signed-off-by: mgoin <mgoin64@gmail.com> * [Misc] downgrade nvidia-cutlass-dsl to 4.5.0 (#43230) Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> * [ROCm] Add QuickReduce min-size override and codec threshold (#41675) Signed-off-by: <> * [CI] Add composed-schema regression tests for DeepSeek V3.2/V4 parsers (#43255) Signed-off-by: Ace Eldeib <aeldeib@coreweave.com> Co-authored-by: Flora Feng <4florafeng@gmail.com> * [Model Runner V2] Fix lora `Triton Error [CUDA]: device-side assert triggered` (#43139) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> * update GPU json file based on h200 recipes (#43262) Signed-off-by: louie-tsai <louie.tsai@intel.com> * [Minor] Bigger overlap for FI AR (#43103) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> * [Bugfix] Fix Qwen3.5 GatedDeltaNet in_proj_ba Marlin failure at TP>=2 (#36329) Signed-off-by: Adi McM Sonus Flow <biuro@sonusflow.pl> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [Perf][Gemma4] Batch vision encoder calls for image and video processing (#43169) Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com> * [CI] Fix "test_vit_cudagraph_[image|video][step3_vl]" failure (#43082) Signed-off-by: haosdent <haosdent@gmail.com> * [Frontend] Normalize reasoning_content to reasoning for client compatibility (#42664) Signed-off-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [Refactor] Use shared coerce_to_schema_type in Seed-OSS tool parser (#43140) Signed-off-by: sfeng33 <4florafeng@gmail.com> * [ToolParser][Bugfix] Re-land: Fix anyOf/oneOf/$ref type resolution in Qwen3CoderToolParser (#37831) (#38973) Signed-off-by: AAISSJ <maze0717@g.skku.edu> Signed-off-by: <> Signed-off-by: sejung-son <sejung.son@nhn.com> Signed-off-by: sfeng33 <4florafeng@gmail.com> Co-authored-by: 세덩 <saison@sedeong-ui-MacBookAir.local> Co-authored-by: sejung-son <sejung.son@nhn.com> Co-authored-by: sfeng33 <4florafeng@gmail.com> * [Frontend][RFC] Rust front-end integration (#40848) Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Bugen Zhao <i@bugenzhao.com> Co-authored-by: Bugen Zhao <i@bugenzhao.com> * [Bugfix] Warn when renderer_num_workers has no effect on offline LLM (#42905) Signed-off-by: Daoyuan Li <94409450+DaoyuanLi2816@users.noreply.github.com> * [Benchmark] Add num-warmup to vllm bench throughput (#43245) Signed-off-by: Yifan Zong <yzong@redhat.com> * [Bugfix] Fix glm4_moe_tool_parser._is_string_type for /v1/responses FunctionTool format (#39601) Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com> Signed-off-by: Chauncey <chaunceyjiang@gmail.com> Signed-off-by: sfeng33 <4florafeng@gmail.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> Co-authored-by: sfeng33 <4florafeng@gmail.com> * [CI] De-flake test_models for bigscience/bloom-560m (#43197) Signed-off-by: haosdent <haosdent@gmail.com> * [XPU] add setuptools-rust for xpu dependency (#43287) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * Update KDA chunk prefill decay to use exp2 semantics (#43195) Signed-off-by: zexplorerhj <19794632+zexplorerhj@users.noreply.github.com> Co-authored-by: zexplorerhj <19794632+zexplorerhj@users.noreply.github.com> * Fix FlashInfer TRTLLM NvFP4 monolithic MoE routing (#43223) Signed-off-by: zhangxin81 <115389973+zhangxin81@users.noreply.github.com> * [Test] Replace zephyr-7b-beta (7B) with SmolLM2-135M in tokenization test (#43085) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [Bug] Fix ci issue `assert output_size is not None` AssertionError (#43261) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Isotr0py <Isotr0py@outlook.com> Co-authored-by: Isotr0py <Isotr0py@outlook.com> * [CI] Pin protoc binary in rust-build stages (#43292) Signed-off-by: haosdent <haosdent@gmail.com> * [XPU][CI]Fix Docker image pull-to-run race in Intel GPU CI (#43266) Signed-off-by: zengxian <xiangdong.zeng@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [CPU][RISC-V] Add VLEN=256 support to RVV attention kernels (#42943) Signed-off-by: velonica0 <like@mail.nankai.edu.cn> Signed-off-by: velonica0 <47554626+velonica0@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Li, Jiang <jiang1.li@intel.com> * [Perf] [Hybrid] Fused Triton kernel for GPU-side Mamba state postprocessing (#40172) Signed-off-by: Francesco Fusco <ffu@zurich.ibm.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [CI] Fix CPU tests failing on `tl.exp2` import (#43311) Signed-off-by: haosdent <haosdent@gmail.com> * [Bugfix] Add early validation to reject incompatible runner types for embedding models (#43079) Signed-off-by: anish <anishesg@users.noreply.github.com> Signed-off-by: Your Name <ak8686@princeton.edu> Signed-off-by: anish <145943060+anishesg@users.noreply.github.com> Co-authored-by: anish <anishesg@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> * [Deprecation] Mark env vars covered by --moe-backend / --linear-backend (#43148) Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> * [Perf] `zeros` -> `empty` to remove additional fill (#42988) Signed-off-by: yewentao256 <zhyanwentao@126.com> * [Core] Add native ModelExpress load format (#43105) Signed-off-by: Zheng Luo <zheluo@nvidia.com> Co-authored-by: OpenAI Codex <codex@openai.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> * Disable build isolation to bypass CUDA related deps for vllm-tpu (#43038) Signed-off-by: Ylang Tsou <ylangt@google.com> Co-authored-by: Ylang Tsou <ylangt@google.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> * [Frontend] Rework fastokens integration (#43168) Signed-off-by: Nick Hill <nickhill123@gmail.com> * [Feature] Add `--cpu-distributed-timeout-seconds` CLI Option for CPU Process Group Timeout (#42968) Signed-off-by: fangyuchu <fangyuchu@qq.com> Signed-off-by: zWaNg3 <389750525@qq.com> Co-authored-by: zWaNg3 <389750525@qq.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [BugFix] Use correct logprobs for `logprob_token_ids` (#43125) Signed-off-by: Nick Hill <nickhill123@gmail.com> * [Bugfix] Zero stale is_prefilling in padded CUDA graph rows for Mamba (#41873) Signed-off-by: Lanze Liu <lanzetech@gmail.com> * [Rust Frontend] Move code from `vllm-frontend-rs` (#43283) Signed-off-by: Bugen Zhao <i@bugenzhao.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Eric Curtin <eric.curtin@docker.com> Signed-off-by: Dev-X25874 <283057883+Dev-X25874@users.noreply.github.com> Signed-off-by: Will.hou <1205157517@qq.com> Signed-off-by: Will.hou <willamhou@ceresman.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Eric Curtin <eric.curtin@docker.com> Co-authored-by: Dev-X25874 <283057883+Dev-X25874@users.noreply.github.com> Co-authored-by: Will.hou <1205157517@qq.com> Co-authored-by: Will.hou <willamhou@ceresman.com> Please see https://github.com/Inferact/vllm-frontend-rs for full original commit history. * [CI] Fix dockerfile dependency graph failure for pre-commit (#43378) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> * [Bugfix] Fix DSV4 Base model swiglu limit issue in FP8 path (#42855) Signed-off-by: Chengze Fan <chengze@meta.com> Signed-off-by: Chengze Fan <fancz2002@gmail.com> Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com> * [ROCm] Add XGMI backend for MoRI Connector (#41753) Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> * [ROCm][CI] add warmup to mem_util test before measurement (#43236) Signed-off-by: Divakar Verma <divakar.verma@amd.com> * [Frontend] Add truncation side to OpenAI endpoints (#43260) Signed-off-by: Rui Zhang <rza21.bc@gmail.com> Signed-off-by: Rui Zhang <rui.zhang@globalrelay.net> Co-authored-by: Rui Zhang <rui.zhang@globalrelay.net> * [Frontend] DP Supervisor (#40841) Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Robert Shaw <robertgshaw2@gmail.com> Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: robertgshaw2-redhat <robertgshaw2@gmail.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> * [Bugfix] Make CuMemAllocator free callback stream-aware (#43020) Signed-off-by: zixi-qi <zixi@inferact.ai> Co-authored-by: Claude <noreply@anthropic.com> * [XPU] Enable multiple key kernels for sparse attention (#37888) Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com> Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [CI] De-flake renderers/test_hf.py::test_resolve_content_format_fallbacks[Qwen/Qwen-VL-string] (#43064) Signed-off-by: haosdent <haosdent@gmail.com> * [Model] Use `AutoWeightsLoader` for Voyage (#42972) Signed-off-by: Furkan Fidan <dev@yufufi.com> * [Model] Fix MiniCPM-V 4.6 vit_merger qkv weight loading (#43213) Signed-off-by: tc-mb <tianchi_cai@icloud.com> * [CI] Fix test_lora_with_spec_decode on V2 model runner (#43314) Signed-off-by: haosdent <haosdent@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> * [CI] Fix "test_awq_load[gemma4-moe-*]" failure (#43296) Signed-off-by: haosdent <haosdent@gmail.com> * Correcting the mock classes for MM GC tests (#43321) Signed-off-by: Weida Hong <wdhongtw@google.com> * [BugFix] Fix setuptools-rust dep in requirements files (#43377) Signed-off-by: Nick Hill <nickhill123@gmail.com> * Fix the docker build failure in tpu-inference (#43360) Signed-off-by: mrjunwan-lang <mrjunwan@google.com> * [Docs] Note image preprocessing difference between qwen_vl_utils and vllm. (#43393) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [CPU] Experimentally enable Triton and MRV2 (#43225) Signed-off-by: jiang1.li <jiang1.li@intel.com> * [Attention] Mamba attention module refactor (#41126) Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> * [XPU]feat: add XPU fallback for MoE topk routing and MXFP4 backend (#42951) Signed-off-by: Ma Jian <jian1.ma@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [Misc] Replace assert with proper exceptions for security and validation in pooling (#43286) Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> * [Bugfix] Clear P0 mm sender cache on sleep/pause to fix mm_hash desync (#43001) Signed-off-by: Tobias Wasner <wasnertobias@gmail.com> * [BugFix] wire make_empty_intermediate_tensors on AyaVision and Voxtral (#43118) Signed-off-by: Keyi Li <likey6688@gmail.com> Co-authored-by: Keyi Li <likey6688@gmail.com> * [LoRA] Reduce memory of 2D weights when EP is set (#42737) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> * [EPLB] Change default EPLB communicator (#43110) Signed-off-by: Markov Ilya <markovilya19@gmail.com> Co-authored-by: Markov Ilya <markovilya19@gmail.com> * [CI] Fix AMD docker build tests (#43329) Signed-off-by: haosdent <haosdent@gmail.com> * Add NVFP4 MOE support for Deepseek V4. (#42209) Signed-off-by: Shiyang Chen <shiychen@nvidia.com> * [Multimodal] Simplify ViT CUDA graph interfaces (#41234) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> * [Rust Frontend] [Refactor] Extract a newtype for utility call ID (#43405) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * [Bugfix] Source num_qo_heads from Attention layers in Flashinfer/Triton metadata builders (#42650) Signed-off-by: zhanda <zhandazhu@gmail.com> Co-authored-by: Shang Wang <shangw@nvidia.com> * [KV Connector] MooncakeStore: don't co-queue save with load to avoid double delayed-free (#43371) Signed-off-by: Dao Le <Dao007forever@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [Refactor] Extract DeepSeek V4 sparse MLA impl into model folder (#43149) * [Frontend] Simplify AuthenticationMiddleware path extraction (#43426) Signed-off-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [RFC][EPLB][#32028] Remove dead torch.accelerator.synchronize() from sync path (#40733) Signed-off-by: SandishKumarHN <3078999+SandishKumarHN@users.noreply.github.com> Co-authored-by: SandishKumarHN <3078999+SandishKumarHN@users.noreply.github.com> * [Bugfix] Detect wrong libcute_dsl_runtime.so variant in FlashInfer GDN (#43427) Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com> * [Bugfix] Clear error message for FP8 torchao quantization on unsupported GPUs (#36854) Signed-off-by: haosdent <haosdent@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * mhc_post - remove sts & add vectorized copies (#43437) Signed-off-by: george <george@inferact.ai> Co-authored-by: george <george@inferact.ai> * [Quantization][ModelOpt] W4A16 NVFP4 fused MoE + mixed-precision dispatch (#42566) Signed-off-by: Juhi Mittal <juhim@nvidia.com> * [Model Runner V2] Support sharing kv cache layers (#35045) Signed-off-by: Nick Hill <nickhill123@gmail.com> * DSv4 fused Q-norm kernel grid refactor (#42353) * [Perf] Optimize hidden state extraction logic (#37374) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [XPU]fix: add XPU platform guards to DeepSeek-V4 ops (#42950) Signed-off-by: Ma Jian <jian1.ma@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * elastic_ep: stage/commit MoE quant method on reconfigure (#40881) Signed-off-by: Itay Alroy <ialroy@nvidia.com> * [Attention] Add head_dim=512 support for FlashInfer trtllm attention backend (#38822) * Add `model` to `WeightTransferEngine.__init__` (#42922) Signed-off-by: SumanthRH <sumanthrh99@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [DSV4] More multi-stream enablement for c4a (#42925) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> * [ROCm][CI] Stabilize runner teardown between sampler tests (#43023) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [ROCm][CI] Stabilize Granite tool-use and test URL construction (#43017) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [Bugfix] Auto-raise max_num_batched_tokens for prefix-LM multimodal models (#43051) Signed-off-by: Ashwin Giridharan <girida@amazon.com> Co-authored-by: abinggo <107740309+abinggo@users.noreply.github.com> * [ROCm][CI] Fix ROCm LoRA Transformers fallback with full CUDA graphs (#41577) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [XPU]feat: enable FP8 block-scaled quantization on XPU (#42952) Signed-off-by: Ma Jian <jian1.ma@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [XPU] reudce host overhead of XPU MOE (#42915) Signed-off-by: mayuyuace <qiming1.zhang@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [7/n] Migrate pos_encoding and norm kernels to libtorch stable ABI (continued) (#43209) Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Signed-off-by: Chris Leonard <chleonar@redhat.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> * [Misc] Added missing return type annotations to improve mypy and IDE tooling (#43383) Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com> * [Bugfix] Fix native Triton top-k/top-p kernel assumes contiguous logi… (#42739) Signed-off-by: xiaogang.zhou <xiaogang.zhou@bytedance.com> Co-authored-by: xiaogang.zhou <xiaogang.zhou@bytedance.com> * [ModelOpt] Support Qwen3.5/3.6 VLM quantized prefix mapping (#42546) Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com> * Keep scheduler alive for delayed KV connector frees (#43433) Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> * fix(eagle3): read norm_before_fc from eagle_config for NVIDIA checkpoint (#42143) Signed-off-by: FERRARIZHENG <popkart06@gmail.com> * [Kernel] Batch invariant NVFP4 linear using cutlass (#39912) Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: Yongye Zhu <zyy1102000@gmail.com> * [ROCm][CI] Remove benchmarks test group and shard long test groups (#41669) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [Bugfix][Frontend] Fix input_audio parsing when uuid is present (#43414) Signed-off-by: ffggs <314137448@qq.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> * [MM] Enable FlashInfer metadata support for Qwen2.5-VL vision attention (#42787) Signed-off-by: Hua Huang <huah@nvidia.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> * [Docs] Fix stale version number in token_embed.md (#43488) Signed-off-by: holegots <ikun3.1415927@gmail.com> * [Docs] Fix stale version number in token_classify.md (#43489) Signed-off-by: holegots <ikun3.1415927@gmail.com> * [MoE] Migrate W4A8 CT to oracle kernel setup (#42680) Signed-off-by: Siddharth Bedekar <bedeksid@gmail.com> Co-authored-by: OpenAI Codex <codex@openai.com> * [Mooncake] Add metrics for MooncakeStoreConnector operations (#43392) * [ROCm][Critical] Fix the GDN import bug (#43486) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> * Revert "[Misc] add humming to dependencies" (#43492) * [Bugfix] Fix reasoning dropped on streaming boundary deltas (#42691) Signed-off-by: sfeng33 <4florafeng@gmail.com> * [Model Runner v2] Force v1 runner for tests (#43233) Signed-off-by: yewentao256 <zhyanwentao@126.com> * [KV Connector] Keep MooncakeStore full hits block-aligned (#43494) Signed-off-by: Dao Le <daole@inferact.ai> Signed-off-by: Dao Le <Dao007forever@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> * [kv_offload]: Add DSv4 support (#43142) Signed-off-by: Or Ozeri <oro@il.ibm.com> * [ROCm][CI] Stabilize 400 error return code for invalid schema inputs (#43016) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [ROCm] [DSv4] [Perf] Support DeepSeek v4 MTP (#43385) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> * Tuning script and configs for Triton Mamba SSU kernel (#43083) Signed-off-by: Banani Ghosh <bg2502@nyu.edu> Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com> Co-authored-by: Banani Ghosh <bg2502@nyu.edu> * File system secondary tier implemented in python (#41735) Signed-off-by: Rotem Shavitt <rshavitt@gmail.com> Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Or Ozeri <oro@il.ibm.com> * [Kernel] Add mhc_pre_big_fuse_with_norm_tilelang (#43474) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> * fix: MoE model using shared routed experts crashes on AMD GPUs (#42373) Signed-off-by: weizhou.lan@daocloud.io <weizhou.lan@daocloud.io> * [Docs] Reorganize offline inference docs. (#43552) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Docker] Non-root support for vllm-openai; add opt-in vllm-openai-nonroot target (#40275) Signed-off-by: TheDuyIT <nduy250299@gmail.com> Signed-off-by: dtnguyen <dtnguyen@nvidia.com> Co-authored-by: Claude <noreply@anthropic.com> * [Feat][KVConnector] Support DSV4 in SimpleCPUOffloadBackend (#42296) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> * [Doc] Add section on escalating stalled contributions (#43568) Signed-off-by: esmeetu <jasonailu87@gmail.com> * Reduce memory usage for granite_speech. (#42933) Signed-off-by: Yihuki <wangbovbvb@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [KV Connector] Handle Mooncake finish after preemption (#43281) Signed-off-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Zhewen Li <zhewenli@inferact.ai> * [Misc] Print accuracy value for PD tests even on success (#43583) Signed-off-by: NickLucche <nlucches@redhat.com> * [Kernel] Remove NormGateLinear (#43554) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> * [XPU] Ensure RNG offset alignment with PyTorch requirements in XPU sampler (#43028) Signed-off-by: chaojun-zhang <chaojun.zhang@intel.com> Signed-off-by: Chaojun Zhang <chaojun.zhang@intel.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [LoRA] Add one shot triton kernel For MoE LoRA (#42290) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * [DeepSeek V4] Move MegaMoE input prep kernel to nvidia/ops (#43632) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [KV Connector][Bugfix] MooncakeStore: don't double-apply Eagle prune in load_mask (#43516) Signed-off-by: Dao Le <daole@inferact.ai> Signed-off-by: Dao Le <Dao007forever@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> * [KV Connector] Propagate MooncakeStore load failures (#42788) Signed-off-by: Dao Le <Dao007forever@gmail.com> * [Bugfix] fix device mismatch in MiniCPM-o-4_5 resampler (#43194) Signed-off-by: Yan Ma <yan.ma@intel.com> * [Frontend] Split the offline inference APIs and utils. (#43553) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Bugfix][Model] Fix GPT2ForSequenceClassification sub-module prefix (#43579) Signed-off-by: QingZhou-YangHY <3868850350@qq.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> * [GDN] GDN Prefill kernel for SM100 (#43273) Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg> * [CPU] Enable non-divisible GQA for decode workitems in mixed batches (#43032) Signed-off-by: zhejiangxiaomai <zhenhui.zhao@intel.com> * Upgrade tpu-inference to v0.20.0 (#43394) * Add CuTe DSL sparse compressor support (#43584) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: OpenAI Codex <codex@openai.com> Co-authored-by: Yongye Zhu <zyy1102000@gmail.com> * [chores][log] change registry log from `warning` to `debug` (#43045) Signed-off-by: Hank <hcc.mayday@gmail.com> * [Bugfix] Apply fc_norm in Eagle3DeepseekV2 combine_hidden_states (#43482) Signed-off-by: Yubo Wang <yubowang2019@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> * [KV Transfer] Enable HMA by default for connectors that support it (#41847) Signed-off-by: Ethan Feng <ethan.fengch@gmail.com> * [Misc][Refactor][ROCm] Convert MoRI-related envvars to extra config args (#43303) Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com> * [Misc] Support interleaved custom image benchmark datasets (#43636) Signed-off-by: ThibaultCastells <thib.castells@icloud.com> * [Reasoning] [Bugfix] Reject invalid thinking_token_budget values (#43402) Signed-off-by: linzm1007 <linzm1007@126.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [Model] Use AutoWeightsLoader for InternLM2 (#38278) Signed-off-by: Jesus De Jesus <dejesus.9297@gmail.com> Signed-off-by: javierdejesusda <javier.dejesusj9@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> * [XPU] Fix fused MoE LoRA kernel crash on XPU by using platform-agnos num_compute_units (#43646) Signed-off-by: Chaojun,Zhang <chaojun.zhang@intel.com> * Fix CuPy runtime deps and restore humming (#43530) Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> * [Docs][ROCm] MoRI-IO Connector Usage Guide (#43603) Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> Signed-off-by: Simon Danielsson <70206058+simondanielsson@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [ROCm][CI] Extend ROCm quick reduce coverage (#40990) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [Feat][DSV4] Fuse q pad into deepseek v4 fused kernel (#43162) * [MoE Refactor] Migrate ModelOptMxFp8FusedMoE to oracle (#42768) Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> * [MoE Refactor] W4a8 int8 oracle (#42789) Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> * [ROCm] Remove MegaMoE integration in deepseek v4 (#43629) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * Add LM head quantization support for ModelOpt (#42124) Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com> * [Doc] Add line limit to AGENTS.md (#43635) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com> * [DSv4] Drop _get_compressed_kv_buffer in DeepseekCompressor (#43690) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [CI] Soft-fail AMD entrypoints mirror tests (#43709) Signed-off-by: Kevin Luu <kevin@inferact.ai> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [Kernel] Porting fuse_minimax_qk_norm to manual fusion (#43410) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> * [KV Connector] MooncakeStore: drop dead discard_partial_chunks parameter (#43627) Signed-off-by: Zhewen Li <zhewen@inferact.ai> Co-authored-by: Zhewen Li <zhewen@inferact.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [Bugfix][V1] Fix TOCTOU race causing intermittent `EADDRINUSE` on multi-API-server DP startup (#42585) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [ci] Add arm64 ci image (#41303) Signed-off-by: khluu <khluu000@gmail.com> Signed-off-by: Kevin H. Luu <khluu000@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [Bugfix] Split attention groups by num_heads_q for spec-decode drafts (#43543) Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com> * [Rust Frontend] Add reasoning/tool parser & renderer roundtrip tests (#43582) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * [ROCm][CI] Fix ROCm multimodal Qwen2.5-VL activation compile and Phi4MM ragged image mask handling (#43647) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [Perf] Optimize Fp8BlockScaledMMLinearKernel input_scale tensor using new_empty() (#43677) Signed-off-by: Xin Yang <xyangx@amazon.com> * [Attention] Make FlexAttention and FlashAttention use num-blocks first layouts (#42095) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> * [MLA][Attention] Add OOT MLA prefill backend registration mechanism (#43325) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> * [Deprecation] Deprecate functions as scheduled for v0.21.0 (#43358) Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [DSv4] Refactor compressor & Fix ROCm compatibility (#43710) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * Fix test_aot_compile for torch 2.12 (#43695) Signed-off-by: Angela Yi <yiangela7@gmail.com> * [KVConnector][Mooncake] Wire reset_cache cascade end-to-end (#42694) Signed-off-by: aoshen524 <aoshen524@gmail.com> Signed-off-by: Ao Shen <aoshen@inferact.ai> Co-authored-by: aoshen524 <aoshen524@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [ROCm][Perf] Expose AITER MoE sorting dispatch policy via env var (#39177) Signed-off-by: nholmber <nholmber@users.noreply.github.com> * [MRV2][BugFix] Fix KV connector handling in spec decode case (#43719) Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> * [Frontend] Add MiniCPM5 XML tool call parser (#43175) Signed-off-by: zhangtao <zhangtao2@modelbest.cn> Signed-off-by: zhangtao2 <zhangtao2@modelbest.cn> Co-authored-by: zhangtao <zhangtao2@modelbest.cn> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> * [ROCm][GPT-OSS] Avoid repeated compile-time `cos_sin_cache.to(bf16)` casts in rotary path (#42833) Signed-off-by: Aakif Nawaz <aakif.nawaz@amd.com> * [Doc] Add Ascend NPU tab to the quickstart installation guide (#43550) Signed-off-by: Aditya Singh <adisin650@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [Rust Frontend] Align tool parser fallback behavior between streaming & non-streaming paths (#43662) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * [Docs] Fix MLA prefill backend default docs (#43697) Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> * [Kernel] Enable TritonW4A16LinearKernel as CUDA fallback for non-Marlin-aligned W4A16 shapes (#43731) Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com> * [Bugfix] Map reasoning_effort to enable_thinking in chat template kwargs (#43401) Signed-off-by: Ashwin Giridharan <girida@amazon.com> Signed-off-by: Chauncey <chaunceyjiang@gmail.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> * [misc] Bump cutedsl version to 4.5.2 (#43745) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> * [BugFix] HFValidationError with cloud storage URIs when HF_HUB_OFFLINE=1 (#39155) Signed-off-by: Injae Ryou <injaeryou@gmail.com> * [Docs] Fix the duplicate doc icon issue (#43546) Signed-off-by: chunyang.wen <chunyang.wen@gmail.com> * Fix early CUDA init (#43791) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [ROCm] mori: add InterNodeV1LL inter-node kernel selection via VLLM_MORI_INTERNODE_KERNEL (#41751) Signed-off-by: jatseng-ai <jatseng@amd.com> * [8/n] Migrate merge_attn_states, mamba, sampler to torch stable ABI (continued) (#43361) Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Signed-off-by: Chris Leonard <chleonar@redhat.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> * [Quantization] Fix Humming RoutedExperts import (#43540) Signed-off-by: Minh Vu <vuhoangminh97@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> * [CI] build-rocm-wheels.yml: reduce MAX_JOBS to prevent OOM Signed-off-by: <callumm@amd.com> --------- Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> Signed-off-by: george <george@inferact.ai> Signed-off-by: Qiuyang Yue <yueqiuyang1389@gmail.com> Signed-off-by: junyanxu <junyanxu5513@gmail.com> Signed-off-by: Gracie Guo <gracieguo@Gracies-MacBook-Pro.local> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: Chaojun,Zhang <chaojun.zhang@intel.com> Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com> Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: hao-aaron <ahao@anyscale.com> Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> Signed-off-by: shen-shanshan <467638484@qq.com> Signed-off-by: ZhanqiuHu <zhu@redhat.com> Signed-off-by: Sage Ahrac <sagiahrak@gmail.com> Signed-off-by: Xinyu Chen <xinyu1.chen@intel.com> Signed-off-by: Wang Yiwen <121547057+yiwen101@users.noreply.github.com> Signed-off-by: sfeng33 <4florafeng@gmail.com> Signed-off-by: Jinzhen Lin <jinzhen.ljz@antgroup.com> Signed-off-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> Signed-off-by: Dao Le <Dao007forever@gmail.com> Signed-off-by: Tomer Asida <57313761+tomeras91@users.noreply.github.com> Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com> Signed-off-by: Kevin H. Luu <khluu000@gmail.com> Signed-off-by: Doğaç Eldenk <dogacel@gmail.com> Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Signed-off-by: Max de Bayser <maxdebayser@gmail.com> Signed-off-by: Fadi Arafeh <fadi.arafeh@arm.com> Signed-off-by: ahao-anyscale <ahao@anyscale.com> Signed-off-by: Terrencezzj <terrence@cohere.ai> Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Philip Maybank <pmaybank@amd.com> Signed-off-by: pmaybank <113125070+pmaybank@users.noreply.github.com> Signed-off-by: mgoin <mgoin64@gmail.com> Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com> Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Signed-off-by: Chris Leonard <chleonar@redhat.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: Artem Perevedentsev <aperevedents@nvidia.com> Signed-off-by: zengxian <xiangdong.zeng@intel.com> Signed-off-by: hallerite <git@hallerite.com> Signed-off-by: Rui Wang <raygorous@gmail.com> Signed-off-by: Kebe <mail@kebe7jun.com> Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com> Signed-off-by: rishitdholakia13 <rishit+github@cohere.com> Signed-off-by: j9smith <j.smith9103@outlook.com> Signed-off-by: Joel Smith <j.smith9103@outlook.com> Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: haosdent <haosdent@gmail.com> Signed-off-by: Meenakshi Venkataraman <meenakshiv@nvidia.com> Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: Douglas Lehr <Doug.Lehr@amd.com> Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> Signed-off-by: <> Signed-off-by: Ace Eldeib <aeldeib@coreweave.com> Signed-off-by: louie-tsai <louie.tsai@intel.com> Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> Signed-off-by: Adi McM Sonus Flow <biuro@sonusflow.pl> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Signed-off-by: Ben Browning <bbrownin@redhat.com> Signed-off-by: AAISSJ <maze0717@g.skku.edu> Signed-off-by: sejung-son <sejung.son@nhn.com> Signed-off-by: Bugen Zhao <i@bugenzhao.com> Signed-off-by: Daoyuan Li <94409450+DaoyuanLi2816@users.noreply.github.com> Signed-off-by: Yifan Zong <yzong@redhat.com> Signed-off-by: Yiyang Liu <37043548+ianliuy@users.noreply.github.com> Signed-off-by: Chauncey <chaunceyjiang@gmail.com> Signed-off-by: zexplorerhj <19794632+zexplorerhj@users.noreply.github.com> Signed-off-by: zhangxin81 <115389973+zhangxin81@users.noreply.github.com> Signed-off-by: Isotr0py <Isotr0py@outlook.com> Signed-off-by: velonica0 <like@mail.nankai.edu.cn> Signed-off-by: velonica0 <47554626+velonica0@users.noreply.github.com> Signed-off-by: Francesco Fusco <ffu@zurich.ibm.com> Signed-off-by: anish <anishesg@users.noreply.github.com> Signed-off-by: Your Name <ak8686@princeton.edu> Signed-off-by: anish <145943060+anishesg@users.noreply.github.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Zheng Luo <zheluo@nvidia.com> Signed-off-by: Ylang Tsou <ylangt@google.com> Signed-off-by: fangyuchu <fangyuchu@qq.com> Signed-off-by: zWaNg3 <389750525@qq.com> Signed-off-by: Lanze Liu <lanzetech@gmail.com> Signed-off-by: Chengze Fan <chengze@meta.com> Signed-off-by: Chengze Fan <fancz2002@gmail.com> Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> Signed-off-by: Divakar Verma <divakar.verma@amd.com> Signed-off-by: Rui Zhang <rza21.bc@gmail.com> Signed-off-by: Rui Zhang <rui.zhang@globalrelay.net> Signed-off-by: Robert Shaw <robertgshaw2@gmail.com> Signed-off-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Signed-off-by: zixi-qi <zixi@inferact.ai> Signed-off-by: Xiaochang Wu <xiaochang.wu@intel.com> Signed-off-by: Wu, Xiaochang <xiaochang.wu@intel.com> Signed-off-by: Furkan Fidan <dev@yufufi.com> Signed-off-by: tc-mb <tianchi_cai@icloud.com> Signed-off-by: Weida Hong <wdhongtw@google.com> Signed-off-by: mrjunwan-lang <mrjunwan@google.com> Signed-off-by: jiang1.li <jiang1.li@intel.com> Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com> Signed-off-by: Ma Jian <jian1.ma@intel.com> Signed-off-by: Tobias Wasner <wasnertobias@gmail.com> Signed-off-by: Keyi Li <likey6688@gmail.com> Signed-off-by: Markov Ilya <markovilya19@gmail.com> Signed-off-by: Shiyang Chen <shiychen@nvidia.com> Signed-off-by: zhanda <zhandazhu@gmail.com> Signed-off-by: Russell Bryant <rbryant@redhat.com> Signed-off-by: SandishKumarHN <3078999+SandishKumarHN@users.noreply.github.com> Signed-off-by: Juhi Mittal <juhim@nvidia.com> Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com> Signed-off-by: Itay Alroy <ialroy@nvidia.com> Signed-off-by: SumanthRH <sumanthrh99@gmail.com> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Ashwin Giridharan <girida@amazon.com> Signed-off-by: mayuyuace <qiming1.zhang@intel.com> Signed-off-by: xiaogang.zhou <xiaogang.zhou@bytedance.com> Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com> Signed-off-by: Zihua Wu <13583761+lucifer1004@users.noreply.github.com> Signed-off-by: FERRARIZHENG <popkart06@gmail.com> Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com> Signed-off-by: ffggs <314137448@qq.com> Signed-off-by: Hua Huang <huah@nvidia.com> Signed-off-by: holegots <ikun3.1415927@gmail.com> Signed-off-by: Siddharth Bedekar <bedeksid@gmail.com> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: Dao Le <daole@inferact.ai> Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: Banani Ghosh <bg2502@nyu.edu> Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com> Signed-off-by: Rotem Shavitt <rshavitt@gmail.com> Signed-off-by: weizhou.lan@daocloud.io <weizhou.lan@daocloud.io> Signed-off-by: TheDuyIT <nduy250299@gmail.com> Signed-off-by: dtnguyen <dtnguyen@nvidia.com> Signed-off-by: esmeetu <jasonailu87@gmail.com> Signed-off-by: Yihuki <wangbovbvb@gmail.com> Signed-off-by: Zhewen Li <zhewenli@inferact.ai> Signed-off-by: chaojun-zhang <chaojun.zhang@intel.com> Signed-off-by: Chaojun Zhang <chaojun.zhang@intel.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Yan Ma <yan.ma@intel.com> Signed-off-by: QingZhou-YangHY <3868850350@qq.com> Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg> Signed-off-by: zhejiangxiaomai <zhenhui.zhao@intel.com> Signed-off-by: Hank <hcc.mayday@gmail.com> Signed-off-by: Yubo Wang <yubowang2019@gmail.com> Signed-off-by: Ethan Feng <ethan.fengch@gmail.com> Signed-off-by: ThibaultCastells <thib.castells@icloud.com> Signed-off-by: linzm1007 <linzm1007@126.com> Signed-off-by: Jesus De Jesus <dejesus.9297@gmail.com> Signed-off-by: javierdejesusda <javier.dejesusj9@gmail.com> Signed-off-by: Simon Danielsson <70206058+simondanielsson@users.noreply.github.com> Signed-off-by: Bill Nell <bnell@redhat.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Kevin Luu <kevin@inferact.ai> Signed-off-by: Zhewen Li <zhewen@inferact.ai> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Signed-off-by: khluu <khluu000@gmail.com> Signed-off-by: Xin Yang <xyangx@amazon.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Angela Yi <yiangela7@gmail.com> Signed-off-by: aoshen524 <aoshen524@gmail.com> Signed-off-by: Ao Shen <aoshen@inferact.ai> Signed-off-by: nholmber <nholmber@users.noreply.github.com> Signed-off-by: zhangtao <zhangtao2@modelbest.cn> Signed-off-by: zhangtao2 <zhangtao2@modelbest.cn> Signed-off-by: Aakif Nawaz <aakif.nawaz@amd.com> Signed-off-by: Aditya Singh <adisin650@gmail.com> Signed-off-by: Injae Ryou <injaeryou@gmail.com> Signed-off-by: chunyang.wen <chunyang.wen@gmail.com> Signed-off-by: jatseng-ai <jatseng@amd.com> Signed-off-by: Minh Vu <vuhoangminh97@gmail.com> Signed-off-by: <callumm@amd.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: gnovack <gnovack@amazon.com> Co-authored-by: george <george@inferact.ai> Co-authored-by: Qiuyang Yue <yueqiuyang1389@gmail.com> Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: gemini-code-assist <noreply@google.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> Co-authored-by: Junyan Xu <junyanxu5513@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Gracie Guo (UX) <114208705+gracie-guo@users.noreply.github.com> Co-authored-by: Gracie Guo <gracieguo@Gracies-MacBook-Pro.local> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io> Co-authored-by: Chaojun Zhang <chaojun.zhang@intel.com> Co-authored-by: Taneem Ibrahim <taneem.ibrahim@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Aaron Hao <ahao@anyscale.com> Co-authored-by: Yifan Qiao <yifanqiao@inferact.ai> Co-authored-by: Shanshan Shen <467638484@qq.com> Co-authored-by: zhanqiuhu <49648934+ZhanqiuHu@users.noreply.github.com> Co-authored-by: Sage <80211083+sagearc@users.noreply.github.com> Co-authored-by: Xinyu Chen <xinyu1.chen@intel.com> Co-authored-by: Wang Yiwen <121547057+yiwen101@users.noreply.github.com> Co-authored-by: Flora Feng <4florafeng@gmail.com> Co-authored-by: Jinzhen Lin <jinzhen.ljz@antgroup.com> Co-authored-by: Dom Brown <3886319+DomBrown@users.noreply.github.com> Co-authored-by: Dao007forever <dao007forever@gmail.com> Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com> Co-authored-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com> Co-authored-by: Doğaç Eldenk <dogacel@gmail.com> Co-authored-by: Max de Bayser <mbayser@br.ibm.com> Co-authored-by: Fadi Arafeh <115173828+fadara01@users.noreply.github.com> Co-authored-by: Terrence Zhao <32208165+Terrencezzj@users.noreply.github.com> Co-authored-by: Benjamin Chislett <bchislett@nvidia.com> Co-authored-by: pmaybank <113125070+pmaybank@users.noreply.github.com> Co-authored-by: Izik Golan <47969623+izikgo@users.noreply.github.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Ronen Schaffer <ronen.schaffer@ibm.com> Co-authored-by: Chris Leonard <chleonar@redhat.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> Co-authored-by: Artem Perevedentsev <aperevedents@nvidia.com> Co-authored-by: xiangdong <40376367+zxd1997066@users.noreply.github.com> Co-authored-by: hallerite <git@hallerite.com> Co-authored-by: Ray Wang <roguerui6@gmail.com> Co-authored-by: Rui Wang <raygorous@gmail.com> Co-authored-by: Kebe <mail@kebe7jun.com> Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com> Co-authored-by: rishitdholakia13 <123388671+rishitdholakia13@users.noreply.github.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Joel Smith <j.smith9103@outlook.com> Co-authored-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: haosdent <haosdent@gmail.com> Co-authored-by: meena-at-work <80416898+meena-at-work@users.noreply.github.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com> Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com> Co-authored-by: akii96 <aakif.nawaz@amd.com> Co-authored-by: Ace Eldeib <alexeldeib@gmail.com> Co-authored-by: Louie Tsai <louie.tsai@intel.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: sonusflow <git@sonusflow.pl> Co-authored-by: Luciano Martins <22145370+lucianommartins@users.noreply.github.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: 손세정 <maze0717@g.skku.edu> Co-authored-by: 세덩 <saison@sedeong-ui-MacBookAir.local> Co-authored-by: sejung-son <sejung.son@nhn.com> Co-authored-by: Bugen Zhao <i@bugenzhao.com> Co-authored-by: Daoyuan Li <94409450+DaoyuanLi2816@users.noreply.github.com> Co-authored-by: yzong-rh <yzong@redhat.com> Co-authored-by: Yiyang "Ian" Liu <yiyangliu@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> Co-authored-by: zexplorerhj <zhjoneson@163.com> Co-authored-by: zexplorerhj <19794632+zexplorerhj@users.noreply.github.com> Co-authored-by: zhangxin81 <115389973+zhangxin81@users.noreply.github.com> Co-authored-by: Isotr0py <Isotr0py@outlook.com> Co-authored-by: velonica0 <47554626+velonica0@users.noreply.github.com> Co-authored-by: Li, Jiang <jiang1.li@intel.com> Co-authored-by: Francesco Fusco <ffu@zurich.ibm.com> Co-authored-by: anish <145943060+anishesg@users.noreply.github.com> Co-authored-by: anish <anishesg@users.noreply.github.com> Co-authored-by: Zheng Luo <zheluo@nvidia.com> Co-authored-by: OpenAI Codex <codex@openai.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: ylangtsou <149562838+ylangtsou@users.noreply.github.com> Co-authored-by: Ylang Tsou <ylangt@google.com> Co-authored-by: fangyuchu <fangyuchu@qq.com> Co-authored-by: zWaNg3 <389750525@qq.com> Co-authored-by: Lanze Liu <86434077+liulanze@users.noreply.github.com> Co-authored-by: Chengze Fan <fancz2002@gmail.com> Co-authored-by: Lu Fang <30275821+houseroad@users.noreply.github.com> Co-authored-by: Simon Danielsson <70206058+simondanielsson@users.noreply.github.com> Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com> Co-authored-by: ruizhang <rza21.bc@gmail.com> Co-authored-by: Rui Zhang <rui.zhang@globalrelay.net> Co-authored-by: robertgshaw2-redhat <robertgshaw2@gmail.com> Co-authored-by: qizixi <22851944+zixi-qi@users.noreply.github.com> Co-authored-by: Xiaochang Wu <xiaochang.wu@intel.com> Co-authored-by: Furkan F <id+git@yufufi.com> Co-authored-by: tc-mb <157115220+tc-mb@users.noreply.github.com> Co-authored-by: Weida Hong <wdhongtw@google.com> Co-authored-by: mrjunwan-lang <mrjunwan@google.com> Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com> Co-authored-by: Ma Jian <jian1.ma@intel.com> Co-authored-by: Tobias Wasner <wasnertobias@users.noreply.github.com> Co-authored-by: Keyi Li <94494390+JasonKeyiL@users.noreply.github.com> Co-authored-by: Keyi Li <likey6688@gmail.com> Co-authored-by: Ilya Markov <markovilya197@gmail.com> Co-authored-by: Markov Ilya <markovilya19@gmail.com> Co-authored-by: sychen52 <41452870+sychen52@users.noreply.github.com> Co-authored-by: Zhanda Zhu <49645678+zhandaz@users.noreply.github.com> Co-authored-by: Shang Wang <shangw@nvidia.com> Co-authored-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Russell Bryant <rbryant@redhat.com> Co-authored-by: SandishKumarHN <sandishkumarhn@gmail.com> Co-authored-by: SandishKumarHN <3078999+SandishKumarHN@users.noreply.github.com> Co-authored-by: Juhi Mittal <39641197+juhi10071998@users.noreply.github.com> Co-authored-by: Itay Alroy <75032521+itayalroy@users.noreply.github.com> Co-authored-by: Duncan Moss <djm.moss@gmail.com> Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com> Co-authored-by: Andreas Karatzas <akaratza@amd.com> Co-authored-by: Ashwin Giridharan <ashwing@users.noreply.github.com> Co-authored-by: abinggo <107740309+abinggo@users.noreply.github.com> Co-authored-by: Qiming Zhang <qiming1.zhang@intel.com> Co-authored-by: Xiaogang Zhou <zhou16386@163.com> Co-authored-by: xiaogang.zhou <xiaogang.zhou@bytedance.com> Co-authored-by: Wei-Ming Chen <17592131+meenchen@users.noreply.github.com> Co-authored-by: Gabriel Wu <13583761+lucifer1004@users.noreply.github.com> Co-authored-by: GuangYaoZheng <popkart06@gmail.com> Co-authored-by: Jakub Zakrzewski <jzakrzewski@nvidia.com> Co-authored-by: ffggs <314137448@qq.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Hua Huang <huangh1994@outlook.com> Co-authored-by: Holegots <fuergaosi@gmail.com> Co-authored-by: Siddharth Bedekar <104613085+bedeks@users.noreply.github.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com> Co-authored-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: danisereb <daserebrenik@nvidia.com> Co-authored-by: Banani Ghosh <bg2502@nyu.edu> Co-authored-by: Rotem Shavitt <rshavitt@gmail.com> Co-authored-by: weizhoublue <45163302+weizhoublue@users.noreply.github.com> Co-authored-by: Nguyễn Thế Duy <dtnguyen@nvidia.com> Co-authored-by: Roy Wang <jasonailu87@gmail.com> Co-authored-by: Yihuki <wangbovbvb@gmail.com> Co-authored-by: Zhewen Li <zhewenli@meta.com> Co-authored-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Yan Ma <yan.ma@intel.com> Co-authored-by: Huanyu Yang <20242081160@mail.dlut.edu.cn> Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg> Co-authored-by: zhao, zhenhui <zhenhui.zhao@intel.com> Co-authored-by: Sting Lin <sting.lin@cienet.com> Co-authored-by: Jie Fang <jief@nvidia.com> Co-authored-by: Hank_ <37239608+ILikeIneine@users.noreply.github.com> Co-authored-by: Yubo Wang <yubowang2019@gmail.com> Co-authored-by: Ethan Feng <ethan.fengch@gmail.com> Co-authored-by: Thibault Castells <38716394+ThibaultCastells@users.noreply.github.com> Co-authored-by: linzm1007 <96732179+linzm1007@users.noreply.github.com> Co-authored-by: Javier De Jesus <javier.dejesusj9@gmail.com> Co-authored-by: bnellnm <49004751+bnellnm@users.noreply.github.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Zhewen Li <zhewen@inferact.ai> Co-authored-by: Xin Yang <105740670+xyang16@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Angela Yi <yiangela7@gmail.com> Co-authored-by: aoshen02 <aoshen@inferact.ai> Co-authored-by: aoshen524 <aoshen524@gmail.com> Co-authored-by: Nico Holmberg <nico.holmberg@amd.com> Co-authored-by: zhangtao2-1 <478679312@qq.com> Co-authored-by: zhangtao <zhangtao2@modelbest.cn> Co-authored-by: Aditya Singh <60082699+adityasingh2400@users.noreply.github.com> Co-authored-by: Injae Ryou <injaeryou@gmail.com> Co-authored-by: Chunyang Wen <chunyang.wen@gmail.com> Co-authored-by: jatseng-ai <jatseng@amd.com> Co-authored-by: Minh Vu <vuhoangminh97@gmail.com>

…ti-API-server DP startup (vllm-project#42585) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

…ation (vllm-project#43864) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>

…ti-API-server DP startup (vllm-project#42585) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…ation (vllm-project#43864) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

…ti-API-server DP startup (vllm-project#42585) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…ation (vllm-project#43864) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

…ti-API-server DP startup (vllm-project#42585) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>

…ation (vllm-project#43864) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>

…ti-API-server DP startup (vllm-project#42585) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…ation (vllm-project#43864) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

Cherry-pick 62 bugfix/security PRs from upstream vllm-project/vllm main (2026-05-03 to 2026-06-17), covering scheduler, engine core, model runner, worker, attention, KV cache, compilation, and structured output fixes. Security (4): vllm-project#43286 vllm-project#44744 vllm-project#45118 vllm-project#45252 Bugfix (56): vllm-project#35536 vllm-project#36616 vllm-project#38895 vllm-project#39155 vllm-project#39324 vllm-project#39562 vllm-project#39805 vllm-project#40398 vllm-project#40726 vllm-project#40727 vllm-project#40737 vllm-project#40749 vllm-project#40961 vllm-project#41119 vllm-project#41133 vllm-project#41233 vllm-project#41237 vllm-project#41411 vllm-project#41496 vllm-project#41549 vllm-project#41674 vllm-project#41873 vllm-project#41895 vllm-project#42040 vllm-project#42112 vllm-project#42289 vllm-project#42479 vllm-project#42585 vllm-project#42692 vllm-project#42706 vllm-project#42709 vllm-project#42739 vllm-project#42967 vllm-project#43001 vllm-project#43079 vllm-project#43125 vllm-project#43160 vllm-project#43616 vllm-project#43669 vllm-project#43719 vllm-project#43768 vllm-project#43808 vllm-project#43961 vllm-project#43982 vllm-project#43988 vllm-project#43998 vllm-project#44057 vllm-project#44560 vllm-project#44574 vllm-project#44568 vllm-project#44603 vllm-project#44744 vllm-project#45195 vllm-project#45345 vllm-project#45383 vllm-project#45487 vllm-project#45564 vllm-project#45673 Runner fix (2): vllm-project#44568 vllm-project#44603 Skipped: vllm-project#43781 (ROCm-specific, not applicable to Ascend NPU) Conflict resolutions: - Manual merge: vllm-project#43286 vllm-project#45118 vllm-project#42112 vllm-project#43160 vllm-project#43719 vllm-project#44560 - Upstream-preferred (-X theirs): vllm-project#43808 vllm-project#43988 vllm-project#42967 vllm-project#35536 vllm-project#45195 - Test files (--theirs): vllm-project#44744 vllm-project#41895 vllm-project#42040 vllm-project#41233 vllm-project#45345 vllm-project#43982 Co-authored-by: GitHub Copilot Signed-off-by: MingqiWang-coder <mingqiwang@hust.edu.cn>

vadiklyutiy requested review from DarkLight1337, NickLucche, aarnphm, hmellor, mgoin, njhill, robertgshaw2-redhat and russellb as code owners May 14, 2026 01:36

claude Bot reviewed May 14, 2026

View reviewed changes

mergify Bot added frontend v1 bug Something isn't working labels May 14, 2026

gemini-code-assist Bot reviewed May 14, 2026

View reviewed changes

Comment thread vllm/v1/utils.py

Comment thread vllm/v1/utils.py

gemini-code-assist Bot reviewed May 14, 2026

View reviewed changes

vadiklyutiy mentioned this pull request May 17, 2026

[Bugfix] Close ApiServer ZMQ bind race with wildcard bind + pipe-back #40596

Closed

mergify Bot added the needs-rebase label May 23, 2026

robertgshaw2-redhat reviewed May 24, 2026

View reviewed changes

Comment thread vllm/entrypoints/cli/serve.py

robertgshaw2-redhat reviewed May 24, 2026

View reviewed changes

robertgshaw2-redhat previously requested changes May 24, 2026

View reviewed changes

vadiklyutiy and others added 2 commits May 25, 2026 13:58

fix race in socket allocation for zmq

9cfac41

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

Update vllm/v1/utils.py

07d2092

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com>

vadiklyutiy mentioned this pull request May 27, 2026

[BugFix] Fix hard-coded timeout for multi-API-server startup #43768

Merged

vadiklyutiy added a commit to vadiklyutiy/vllm that referenced this pull request May 28, 2026

[Bugfix][Ray DP] Pre-allocate API-server ports to fix vllm-project#42585

bdb6114

hang Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

vadiklyutiy added a commit to vadiklyutiy/vllm that referenced this pull request May 28, 2026

[Bugfix][Ray DP] Pre-allocate API-server ports to fix vllm-project#42585

934da40

hang Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

vadiklyutiy mentioned this pull request May 28, 2026

[Bugfix] Exclude Ray DP from #42585's deferred port allocation #43864

Merged

njhill pushed a commit that referenced this pull request May 28, 2026

[Bugfix] Exclude Ray DP from #42585's deferred port allocation (#43864)

5d126dd

Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

andakai pushed a commit to andakai/vllm that referenced this pull request Jun 4, 2026

[Bugfix] Exclude Ray DP from vllm-project#42585's deferred port alloc…

aa4fbf2

…ation (vllm-project#43864) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

arbi-dev pushed a commit to arbicity/vllm-turboattn that referenced this pull request Jun 6, 2026

[Bugfix] Exclude Ray DP from vllm-project#42585's deferred port alloc…

1be7a57

…ation (vllm-project#43864) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

knight0528 pushed a commit to knight0528/vllm that referenced this pull request Jun 8, 2026

[Bugfix] Exclude Ray DP from vllm-project#42585's deferred port alloc…

cfb31f9

…ation (vllm-project#43864) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026

[Bugfix] Exclude Ray DP from vllm-project#42585's deferred port alloc…

c1816fa

…ation (vllm-project#43864) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026

[Bugfix] Exclude Ray DP from vllm-project#42585's deferred port alloc…

c14cfaf

…ation (vllm-project#43864) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com>

MingqiWang-coder mentioned this pull request Jul 1, 2026

[Sync] Upstream V1 engine core — 89 PRs (bugfix, scheduler, runner, worker, hardware) vLLM-HUST/vllm-hust#82

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix][V1] Fix TOCTOU race causing intermittent `EADDRINUSE` on multi-API-server DP startup#42585

[Bugfix][V1] Fix TOCTOU race causing intermittent `EADDRINUSE` on multi-API-server DP startup#42585
vllm-bot merged 7 commits into
vllm-project:mainfrom
vadiklyutiy:zmq-socket-fix

vadiklyutiy commented May 14, 2026 •

edited

Loading

claude Bot left a comment

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

vadiklyutiy commented May 14, 2026

gemini-code-assist Bot left a comment

gemini-code-assist Bot May 14, 2026

vadiklyutiy commented May 14, 2026

vadiklyutiy commented May 15, 2026

vadiklyutiy commented May 17, 2026

mergify Bot commented May 23, 2026

Uh oh!

robertgshaw2-redhat commented May 24, 2026

robertgshaw2-redhat May 24, 2026 •

edited

Loading

robertgshaw2-redhat May 24, 2026 •

edited

Loading

robertgshaw2-redhat May 24, 2026

njhill May 24, 2026

robertgshaw2-redhat left a comment •

edited

Loading

njhill commented May 24, 2026

njhill commented May 24, 2026 •

edited

Loading

ehfd commented May 27, 2026

vadiklyutiy commented May 27, 2026

ehfd commented May 27, 2026

vadiklyutiy commented May 27, 2026

ehfd commented May 29, 2026

Labels

6 participants

Uh oh!

Uh oh!

Conversation

vadiklyutiy commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Root cause

Fix

Tests

Unit test

Stress testing

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

vadiklyutiy commented May 14, 2026

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

gemini-code-assist Bot May 14, 2026

Choose a reason for hiding this comment

vadiklyutiy commented May 14, 2026

vadiklyutiy commented May 15, 2026

vadiklyutiy commented May 17, 2026

mergify Bot commented May 23, 2026

Uh oh!

robertgshaw2-redhat commented May 24, 2026

robertgshaw2-redhat May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

robertgshaw2-redhat May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

robertgshaw2-redhat May 24, 2026

Choose a reason for hiding this comment

njhill May 24, 2026

Choose a reason for hiding this comment

robertgshaw2-redhat left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

njhill commented May 24, 2026

njhill commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ehfd commented May 27, 2026

vadiklyutiy commented May 27, 2026

ehfd commented May 27, 2026

vadiklyutiy commented May 27, 2026

ehfd commented May 29, 2026

Labels

6 participants

vadiklyutiy commented May 14, 2026 •

edited

Loading

robertgshaw2-redhat May 24, 2026 •

edited

Loading

robertgshaw2-redhat May 24, 2026 •

edited

Loading

robertgshaw2-redhat left a comment •

edited

Loading

njhill commented May 24, 2026 •

edited

Loading