Skip to content

[ROCm][Quant][Perf] Minimax-M3: Enable fp8_per_channel for bf16 weights on mi300x#45854

Merged
tjtanaa merged 3 commits into
vllm-project:mainfrom
hongxiayang:fp8-pc-rocm
Jun 17, 2026
Merged

[ROCm][Quant][Perf] Minimax-M3: Enable fp8_per_channel for bf16 weights on mi300x#45854
tjtanaa merged 3 commits into
vllm-project:mainfrom
hongxiayang:fp8-pc-rocm

Conversation

@hongxiayang

@hongxiayang hongxiayang commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Purpose

Improve the perf of Minimax-M3 bf16 model on MI300x (gfx942)/

The fp8 w8a8 MoE quant config dropped the SwiGLU-OAI alpha/beta that models
such as MiniMax-M3 pass to FusedMoE (swiglu_alpha=1.702, swiglu_beta=1.0).
Only swiglu_limit was forwarded, so the silu_and_mul_with_clamp kernel ran
with its default alpha=1.0/beta=0.0 and produced garbage (gsm8k 0.00) on both
the serialized (Fp8MoEMethod) and online (_Fp8OnlineMoEBase) fp8 MoE paths.

Plumb gemm1_alpha/gemm1_beta through the fp8 w8a8 MoE config chain:

fp8_w8a8_moe_quant_config: add params, forward to FusedMoEQuantConfig.make
make_fp8_moe_quant_config: forward them in the default/TRITON branch
Fp8MoEMethod.get_fused_moe_quant_config: read layer.swiglu_alpha/beta
_Fp8OnlineMoEBase.get_fused_moe_quant_config: same, for the online path
Also add "fp8_per_channel" to the ROCm supported_quantization allowlist. The
PTPC methods (Fp8PtpcOnline{Linear,MoE}Method) already exist and the ROCm
rowwise fp8 scaled-MM supports per-channel weight scales; the method was simply
not allowlisted on ROCm.

Verified on bf16 model with --quantization fp8_per_channel on MI300x.

For the 1k/1k config:

  • halved weight bytes/token → ~2× weight-read bandwidth in decode (decode at batch is HBM-bandwidth-bound). This is the +28% conc64.
  • 49% weight memory → +75% KV capacity → fit ~1.75× more concurrent requests (or ~1.75× longer context) before hitting the KV ceiling.

Use case beneficial:

  • Long output-len, high concurrency (decode-dominated)
  • Long total context × high concurrency

Aided with Claude.

Test Plan

serv

VLLM_USE_BREAKABLE_CUDAGRAPH=0 VLLM_ROCM_USE_AITER=1 VLLM_ROCM_USE_AITER_MOE=0 \
vllm serve <WEIGHTS> \
  --served-model-name mm3 \
  --tensor-parallel-size 8 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 4096 \
  --block-size 128 \
  --attention-backend TRITON_ATTN \
  --no-enable-prefix-caching \
  --quantization fp8_per_channel \
  --compilation-config '{"cudagraph_mode":"FULL_DECODE_ONLY"}'

Throughput bench (vllm bench serve, random 1k-in/1k-out)

vllm bench serve --backend vllm \
  --model mm3 --tokenizer <WEIGHTS> \
  --base-url http://localhost:8000 \
  --dataset-name random --random-input-len 1024 --random-output-len 1024 \
  --ignore-eos --num-prompts <N> --max-concurrency <C>   # C=1 (N=8), C=64 (N=128)

Accuracy (gsm8k via lm_eval, full 1319 samples)

# 20-shot (server --max-model-len 8192; ):
lm_eval --model local-completions \
  --model_args "model=mm3,base_url=http://localhost:8000/v1/completions,\
tokenized_requests=False,tokenizer_backend=None,num_concurrent=200,\
timeout=5000,max_length=8192" \
  --tasks gsm8k --num_fewshot 20 --limit 1319

Test Result

config model GiB/GPU GPU KV tok conc1 tok/s (TPOT ms) conc64 tok/s (TPOT ms)
bf16 (no quant) 99.9 1.55M 101.9 (9.72) 1476 (41.8)
bf16 + AITER linear 99.9 1.55M 100.8 (9.76) 1461 (42.4)
fp8_per_channel (PTPC) 50.6 2.72M 88.6 (11.04) 1801 (34.14)
fp8_per_channel + AITER linear 50.6 2.72M 94.0 (10.53) 1894 (32.42)
20-shot, full-1319 gsm8k flex / strict stderr
bf16 (no quant) 0.9424 / 0.9424 ±0.0064
fp8_per_channel + AITER linear 0.9393 / 0.9393 ±0.0066
gap 0.0031 (~0.3σ)

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
… for bf16 weights on mi300x

Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
@mergify mergify Bot added the rocm Related to AMD ROCm label Jun 16, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD Jun 16, 2026
@hongxiayang hongxiayang changed the title [ROCm][Quant] Fix SwiGLU-OAI fp8 MoE garbage + enable fp8_per_channel for bf16 weights on mi300x Jun 16, 2026

@tjtanaa tjtanaa left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Comment thread vllm/platforms/rocm.py Outdated
"modelopt_mixed",
"fp8_per_tensor",
"fp8_per_block",
"fp8_per_channel", # PTPC: per-channel weight + per-token act fp8

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hongxiayang small NITS: we don't need the comment.

per_act_token_quant=self.per_act_token_quant,
per_out_ch_quant=self.per_out_ch_quant,
swiglu_limit=getattr(layer, "swiglu_limit", None),
# SwiGLU-OAI alpha/beta (e.g. MiniMax-M3: 1.702/1.0). Without these

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hongxiayang nits: we can remove these comments because the codebase already knew that this arguments are needed by minimaxm3

) -> FusedMoEQuantConfig:
"""
Construct a quant config for fp8 activations and fp8 weights.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hongxiayang nits: we can remove these comments because the codebase already knew that this arguments are needed by minimaxm3

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>

@tjtanaa tjtanaa left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tjtanaa tjtanaa added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 17, 2026
@mergify

mergify Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Hi @hongxiayang, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

@tjtanaa tjtanaa enabled auto-merge (squash) June 17, 2026 04:40
@tjtanaa tjtanaa merged commit e28e8c8 into vllm-project:main Jun 17, 2026
97 checks passed
@github-project-automation github-project-automation Bot moved this from Todo to Done in AMD Jun 17, 2026
NathanielMcVicar pushed a commit to NathanielMcVicar/vllm that referenced this pull request Jun 17, 2026
…mi300x (vllm-project#45854)

Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: Nathaniel McVicar <Nathaniel.McVicar@microsoft.com>
@hongxiayang hongxiayang changed the title [ROCm][Quant] Minimax-M3: Enable fp8_per_channel for bf16 weights on mi300x Jun 18, 2026
vivek8123 pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Jun 18, 2026
…mi300x (vllm-project#45854)

Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: vivek sharma <vivsharm@redhat.com>
divineearthly pushed a commit to divineearthly/vllm that referenced this pull request Jun 19, 2026
…mi300x (vllm-project#45854)

Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: divineearthly <divineearthly@gmail.com>
xuebwang-amd pushed a commit to xuebwang-amd/vllm that referenced this pull request Jun 21, 2026
…mi300x (vllm-project#45854)

Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026
…mi300x (vllm-project#45854)

Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
lcheng321 pushed a commit to lcheng321/vllm that referenced this pull request Jun 22, 2026
…mi300x (vllm-project#45854)

Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: lcheng <lcheng321@gatech.edu>
RhizoNymph added a commit to RhizoNymph/vllm that referenced this pull request Jun 22, 2026
* [Kernel][Helion][1/N] Add Helion kernel for per_token_group_fp8_quant (#36902)

Signed-off-by: Sean Chen <seachen@redhat.com>
Co-authored-by: Yanan Cao <gmagogsfm@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* [Bugfix] Restrict FlashInfer cuDNN FP8 ViT attention gate to Blackwell (SM 100) (#45251)

Signed-off-by: Wentian Byte <3400259131@qq.com>

* [Rust Frontend] Support continuous_usage_stats stream option (#43965)

Co-authored-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: RickyChen / 陳昭儒 <ricky.chen@infinirc.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>

* [Bugfix] Fix Anthropic tool_use content handling dropping args (#45287)

Signed-off-by: Ben Browning <bbrownin@redhat.com>

* [Model] Remove InternLMForCausalLM registry alias (#45128)

Signed-off-by: Xianbao QIAN <xianbao.qian@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>

* [Bug] Fix test flashmla for DSv4 (#45052)

Signed-off-by: yewentao256 <zhyanwentao@126.com>

* [Refactor] Chat Completions Harmony Refactor, non-streaming path. (#45171)

Signed-off-by: Yifan Zong <yzong@redhat.com>

* [Bugfix][KVConnector][Mooncake] Close MooncakeDistributedStore on connector teardown (#45206)

Signed-off-by: Dao Le <Dao007forever@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>

* Make mistral_common optional by deferring MistralToolCall import (#45305)

Signed-off-by: Neil Schemenauer <nas@arctrix.com>

* [Bugfix] Initialize missing attributes in mistral eagle (#45217)

Signed-off-by: jpwang <jpwang@smail.nju.edu.cn>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* [Refactor] Chat Completions Streaming Harmony Refactor and Bugfixes (#45104)

Signed-off-by: Yifan Zong <yzong@redhat.com>

* [Bugfix] OffloadingConnector: respect skip_reading_prefix_cache flag (#44592)

Signed-off-by: Hsiao-Yuan Chen <hy.c@Hsiao-YuandeMacBook-Pro.local>
Signed-off-by: littlecircle0730 <littlecircle0730@gmail.com>
Signed-off-by: littlecircle0730 <43994952+littlecircle0730@users.noreply.github.com>
Co-authored-by: Hsiao-Yuan Chen <hy.c@Hsiao-YuandeMacBook-Pro.local>
Co-authored-by: Or Ozeri <or@ozery.com>

* [ROCm][DSv4][Perf] Flash-decode split-K decode attention kernel (#44899)

Co-authored-by: vLLM Contributor <contributor@vllm.ai>

* [Bugfix][Model] Pass revision by name in Run:ai and bitsandbytes index downloads (#45308)

Signed-off-by: Ting Sun <suntcrick@gmail.com>

* [CI][BugFix] Fix broken `test_mamba_prefix_cache.py` due to stale mock (#45345)

Signed-off-by: Nick Hill <nickhill123@gmail.com>

* [Bugfix] Fix --enable-prompt-tokens-details omitting zero cached tokens (#44383)

Signed-off-by: Sasindharan Sankar <sasindharansankar@email.com>
Co-authored-by: Sasindharan Sankar <sasindharansankar@email.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>

* [ASR] Optimize CPU preproc to get 2.5x RTFx via multi-threading (#44612)

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* [Bugfix] Mamba CPU Offloading (#44599)

Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com>
Co-authored-by: varun sundar rabindranath <vsundarr@redhat.com>

* [ASR] Add Long Audio benchmark and correctness test (#44587)

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>

* [11a/n]  Migrate Marlin kernels to torch stable ABI (#45176)

Signed-off-by: Chris Leonard <chleonar@redhat.com>

* [NIXL] Per-region KV transfer classification for mixed full-attn + MLA groups (#44583)

* [ROCm][CI] fix fp8 support for test_deepep_moe (#45302)

Signed-off-by: Divakar Verma <divakar.verma@amd.com>

* [Model] Add DiffusionGemma Support (#45163)

Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Martin Kukla <martin.kukla@cantab.net>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Dipika Sikka <dsikka@redhat.com>
Co-authored-by: NickLucche <nlucches@redhat.com>
Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Co-authored-by: Alec Kohlhoff <134344302+aleckohlhoff@users.noreply.github.com>
Co-authored-by: Porras Huang <20535584+porrashuang@users.noreply.github.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: scoootscooob <167050519+scoootscooob@users.noreply.github.com>

* [MM][Perf][CG] Support ViT full cudagraphs for mllama4 (#40660)

Signed-off-by: allgather <all2allops@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>

* [ROCm][gpt-oss] Pass GateMode.INTERLEAVE for MXFP4 W4A16 fused MoE (#44893)

Signed-off-by: Rohan Potdar <rohan.potdar@amd.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com>

* [Bugfix] Fix Dockerfile dependency graph pre-commit error (#45374)

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

* [CPU] Support CPU W4A16 INT4 MoE (#43409)

Signed-off-by: yuwenzho <yuwen.zhou@intel.com>

* [Rust Frontend][Bugfix] Forward --shutdown-timeout and --disable-log-stats to the managed Python engine (#45300)

Signed-off-by: Will Eaton <weaton@redhat.com>

* [XPU][DeepSeek-V4] Fix MTP: sync with upstream fixes #44821 and #43746 (#45240)

Signed-off-by: Ma Jian <jian1.ma@intel.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* [CI] ci-fetch-log.sh: fetch all failed jobs from a build URL or PR number (#45274)

Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

* [Frontend]  Support strict mode for tool calling (#45003)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Co-authored-by: cjackal <44624812+cjackal@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* [Bugfix][Rust Frontend] Return 400 for prompt-validation submit errors (#45286)

Signed-off-by: xiaguan <751080330@qq.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

* Update hidden states extraction integration test triggers (#45294)

Signed-off-by: Fynn Schmitt-Ulms <fschmitt@redhat.com>

* Fix misleading error for audio duration limit rejection (#45113)

Signed-off-by: jperezde <jperezde@redhat.com>

* [Doc] AGENTS.md: add section about coding style (#45301)

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

* [11b/n] Migrate Machete kernels to torch stable ABI (#45304)

Signed-off-by: Chris Leonard <chleonar@redhat.com>
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Co-authored-by: Shengqi Chen <harry-chen@outlook.com>

* [KV Connector]: Support KV push from Prefill to Decode node using Nixl KV Connector (#35264)

Signed-off-by: Sunita Nadampalli <nadampal@amazon.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com>

* [Model] Remove Mono-InternVL (InternLM2VEForCausalLM) (#45129)

Signed-off-by: Xianbao QIAN <xianbao.qian@gmail.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>

* [BUGFIX][XPU] Update fa interface for compatibility (#45394)

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>

* [Metrics] Add group-aware KV cache capacity to vllm:cache_config_info (#42206)

The startup log already reports the correct group-aware KV cache capacity for
hybrid models, but Prometheus did not expose matching info in 'vllm:cache_config_info`.

This PR adds kv_cache_size_tokens and kv_cache_max_concurrency.

Signed-off-by: Ethan Feng <ethan.fengch@gmail.com>

* [V1][Metrics] Add MLA attention metrics for DeepSeek MFU estimation (#39457)

Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>

* [Bug] Migrate Reset cache for both v2 and v1 model runner (#42759)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* [Core] Support structured outputs for beam search (#35022)

Signed-off-by: Guan-Ming (Wesley) Chiu <guanmingchiu@gmail.com>
Signed-off-by: Guan-Ming (Wesley) Chiu <105915352+guan404ming@users.noreply.github.com>

* [Core][KV Connector] fix scheduler KV connector stats aggregation (#43877)

Fixes scheduler-side KV connector stats collection so that:

1. update_connector_output() runs before scheduler-side stats are collected.
2. worker-side and scheduler-side KV connector stats are aggregated when both are present.
3. scheduler-only KV connector stats are still emitted when no worker-side stats exist.

Signed-off-by: srinivas_oo7 <sklinkedin0120@gmail.com>
Co-authored-by: srinivas_oo7 <sklinkedin0120@gmail.com>

* [Frontend] Support strict mode for tool calling with ResponsesAPI (#45396)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

* [Docs][KV Connector][NIXL] document KV Transfer stat logging and Prometheus metrics (#44055)

Signed-off-by: Sai Sridhar <tarrasridhar1154@gmail.com>

* [Rust Frontend] Add standalone `granite4` tool parser (#45216)

Signed-off-by: Tahsin Tunan <tahsintunan@gmail.com>
Co-authored-by: Bugen Zhao <i@bugenzhao.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* [Model] Add encoder CUDA graph support to Lfm2VL (#44930)

Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>

* [Kernel][Helion][1/N] Add Helion kernel for dynamic_per_token_scaled_fp8_quant (#33790)

Signed-off-by: Sean Chen <seachen@redhat.com>
Co-authored-by: Yanan Cao <gmagogsfm@gmail.com>

* [Model][Dflash] Enable Dflash support for Qwen3NextForCausalLM targets (#45319)

Signed-off-by: Jonas I. Liechti <j-i-l@t4d.ch>

* [Migration] Migrate GGUF quantization support to plugin (#39612)

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

* [Perf] Use native DSA indexer decode path for next_n > 2 on SM100 (#45322)

Signed-off-by: zixi-qi <zixi@inferact.ai>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>

* [Core][AMD] Propagate shutdown timeout to MultiprocExecutor (#43154)

Signed-off-by: Ryan Rock <ryan.rock@amd.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* [Refactor] Deprecate ResponsesParser wrapper, inline parsing into ParsableContext (#45431)

Signed-off-by: sfeng33 <4florafeng@gmail.com>

* [ROCm] Bump Torch to 2.11 (#45362)

Signed-off-by: Micah Williamson <micah.williamson@amd.com>

* [Attention] Improve attention benchmarks: configs and profiling (#39336)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>

* [Model Runner v2] Migration from v1 to v2, with Qwen and DSv2 MOE models [3/N] (#42667)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* [Kernel] Consolidate Marlin thread-tile padding across all dense Marlin paths (#45295)

Signed-off-by: mgoin <mgoin64@gmail.com>

* Add the QuantizedActivation linear-kernel contract (#44260)

Signed-off-by: mgoin <mgoin64@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>

* [ROCm][DSV4][Perf] Fuse inverse-RoPE and cache bf16 wo_a in o-projection (#45103)

Signed-off-by: Fangzhou Ai <fangzhouai@gmail.com>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

* [Bugfix][CPU] Don't build triton-cpu on arm64 release image (#45401)

Signed-off-by: khluu <khluu000@gmail.com>

* [BugFix] Avoid prematurely freeing cached mm encoder outputs (#45347)

Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: Nick Hill <nickhill123@gmail.com>

* [Bugfix] Set type/role explicitly in streaming message_start event (#45376)

Signed-off-by: Wayne Chiu <waynehacking8@gmail.com>

* [Bugfix] Replace deprecated Qwen2VLImageProcessorFast with Qwen2VLImageProcessor (#42700)

Signed-off-by: abinggo <107740309+abinggo@users.noreply.github.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Roger Wang <hey@rogerw.io>

* [CI] Wait for SSL cert refresher events in the test (#45489)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

* [Render] Add `/derender` endpoints for disaggregated postprocessing (#43606)

Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>

* [Bugfix] Return the tokenizer from maybe_make_thread_pool so it survives pickling (#45460)

Signed-off-by: Wayne Chiu <waynehacking8@gmail.com>

* [Doc] Fix uv dependency resolution failure for setuptools during CPU source builds (x86 & ARM) (#45412)

Signed-off-by: midas <the.anon.github@gmail.com>

* [Model Runner V2] Fix `openai.InternalServerError: Error code: 500 - 'list index out of range'` (#45467)

Signed-off-by: yewentao256 <zhyanwentao@126.com>

* Treat null completion max_tokens like the default (#45491)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

* [CI Bug] Fix `ValueError: There is no module or parameter named 'model.vision_tower.vision_model'` (#45478)

Signed-off-by: yewentao256 <zhyanwentao@126.com>

* [Security] Add timeout guard for regex compilation in structured outp… (#45118)

Signed-off-by: jperezde <jperezde@redhat.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

* [Security] Fix DoS via prompt_embeds on M-RoPE models (#45252)

Signed-off-by: jperezde <jperezde@redhat.com>

* Fix docs build on `main` (#45536)

Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

* [Bugfix] Reject structured outputs for diffusion decoders with a clear error (#45468)

Signed-off-by: Wayne Chiu <waynehacking8@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>

* [Perf] SM90 cutlass fp8 mm supports odd M by swap_ab, 180~290% kernel performance improvement (#44572)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* [Core] Simplify MRV2 async output handling (#45442)

* [Bugfix] nightly Docker images crash with ImportError: AnthropicOutputConfig since May 28 (#44795)

Signed-off-by: achyuthan.s <113010327+Achyuthan-S@users.noreply.github.com>
Signed-off-by: Achyuthan S <achyuthan.sivasankar@gmail.com>
Signed-off-by: Achyuthan Sivasankar <achyuthan.sivasankar@gmail.com>
Co-authored-by: Shengqi Chen <harry-chen@outlook.com>

* [Build] Fix CUDA arch build coverage gaps (#45277)

Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Co-authored-by: Xin Li <xinli-sw@users.noreply.github.com>
Co-authored-by: ShawRong <ShawRong@users.noreply.github.com>
Co-authored-by: Change72 <Change72@users.noreply.github.com>

* [V1][Spec Decode] Add Dynamic SD (#32374)

Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>

* [Bugfix][DCP] Fix illegal memory access in DCP a2a decode under full CUDA graphs (#45487)

* [XPU] Support int4 group_size=32 W4A16 MoE (#45136)

Signed-off-by: Marceli Fylcek <marceli.fylcek@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>

* [ROCm][Perf] Enable W4A16 FlyDSL MoE (#44400)

Signed-off-by: amd-asalykov <asalykov@amd.com>
Signed-off-by: Amanzhol Salykov <asalykov@amd.com>

* [Perf] Use bisect for mm feature lookup in model runner v2 (#45566)

Signed-off-by: Roger Wang <hey@rogerw.io>

* [BugFix] Fix prompt_embeds for multimodal models (#45383)

Signed-off-by: ruinan ma <r7ma3088@gmail.com>

* Added real  /v1/embeddings support for messages + chat_template_kw  (#45173)

Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com>

* [Bugfix][Model] Validate runai_streamer model_loader_extra_config (#45291)

Signed-off-by: Ting Sun <suntcrick@gmail.com>

* [Bugfix] Stream Llama4 weight loading to avoid host-OOM with copy-returning loaders (#44645)

Signed-off-by: Noa Neria <nneria@nvidia.com>

* [XPU] Enable sequence parallel support for XPU (#38608)

Signed-off-by: chaojun-zhang <chaojun.zhang@intel.com>
Signed-off-by: Chaojun Zhang <chaojun.zhang@intel.com>
Signed-off-by: Chaojun,Zhang <chaojun.zhang@intel.com>

* [Bugfix][CPU] Honor cgroup memory limit when computing KV cache size (#45086)

Signed-off-by: baoloongmao <baoloongmao@tencent.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>

* [CPU] Refine CPU attention frontend (#45391)

Signed-off-by: jiang1.li <jiang1.li@intel.com>

* [Bugfix][CI] Update Dockerfile dependency graph PNG (#45602)

Signed-off-by: sfeng33 <4florafeng@gmail.com>

* [Frontend] Add Streaming Parser Engine and new Qwen3 Parser (#45413)

Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: Flora Feng <4florafeng@gmail.com>

* Fix included router missing path for `FastAPI >=0.137` (#45629)

Signed-off-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>

* [Bugfix][V1] Split V2 model-runner attention groups on num_heads_q (#45564)

Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>

* [Model] Remove XverseForCausalLM (#45638)

Signed-off-by: Xianbao QIAN <xianbao.qian@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>

* [Feature][Frontend] Report multimodal token counts in usage.prompt_tokens_details (#45458)

Signed-off-by: Ting Sun <suntcrick@gmail.com>

* [Bugfix] Reject out-of-range temperature values in SamplingParams (#44965)

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>

* [Bugfix][Rust] Sync EngineCoreReadyResponse with the Python dataclass (#45557)

Co-authored-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Will Eaton <weaton@redhat.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>

* [Rust Frontend] Add external→internal request-id map for abort() (#45137)

Signed-off-by: Sahil Singh <sahiilsiingh37@gmail.com>

* [Models] Fix MiMo v2.x QKV TP sharding + FP4 support (#45200)

Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* [Rust Frontend] Support `parallel_tool_calls = false` (#44760)

Signed-off-by: zhoujinyu <2319109590@qq.com>

* [Bugfix][Rust Frontend] Make metrics respect --served-model-name (#45465)

Signed-off-by: reidliu41 <reid201711@gmail.com>

* [XPU] skip UT test_with_ngram_gpu_spec_decoding (#44423)

Signed-off-by: Lai, Yejing <yejing.lai@intel.com>

* [ROCm][Doc] Add installation notes about python version requirement (#45671)

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

* [Docs] Update the online serving docs. (#45676)

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>

* [Bugfix] Unset HF's default max_new_tokens for DiffusionGemma (#45417)

Signed-off-by: Martin Kukla <martin.kukla@cantab.net>

* (security) Enforce audio upload size limit before full file materialization (#45510)

Signed-off-by: jperezde <jperezde@redhat.com>

* Fix the E8M0 scale computation in the MXFP4 (W4A4) MOE CUTLASS kernel (#43557)

Signed-off-by: Xin He <xin3.he@intel.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>

* Remove redundant Triton KV cache dtype asserts and enforce architectural support (fp8 >= sm89) (#43914)

Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com>
Co-authored-by: Michael Gschwind <mgschwind@nvidia.com>

* [Bugfix] Two-phase KV allocation for cross-group prefix cache hits (supersedes #33775) (#44409)

Signed-off-by: Saddss <2872669061@qq.com>

* [Chore] Consolidate reasoning/tool parser attributes into unified Parser in chat serving (#45548)

Signed-off-by: sfeng33 <4florafeng@gmail.com>

* [AMD][Bugfix][Quantization] Honor fused-name match in is_layer_skipped (#43981)

* [Model] Add MiniMax M3 support (#45381)

Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
Co-authored-by: OpenAI Codex <codex@openai.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg>
Co-authored-by: Bugen Zhao <i@bugenzhao.com>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
Co-authored-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>
Co-authored-by: Jee Jee Li <jeejeelee@inferact.ai>

* [KV Offloading] Implement `reset_cache` for `TieringOffloadingManager` (#44541)

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* [Bugfix] Chat Completions Harmony Refactor Clean up (#45464)

Signed-off-by: Yifan Zong <yzong@redhat.com>
Co-authored-by: Ben Browning <bbrownin@redhat.com>

* [Perf] Optimize DSv4 prefill chunk planning, 4.0% E2E Throughput Improvement (#45061)

Signed-off-by: yewentao256 <zhyanwentao@126.com>

* [Frontend] Skip structural tags for auto tool_choice without strict mode (#45600)

Signed-off-by: sfeng33 <4florafeng@gmail.com>

* [Model Runner V2][Bugfix] Fix MRV2 LoRA warmup (#35536)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai>

* Fix parallel_tool_calls: null treated as false instead of default true (#44955)

Signed-off-by: factnn <166481866+factnn@users.noreply.github.com>

* [Frontend] Replace legacy Gemma4 parsers with engine-based implementation (#45588)

Signed-off-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: Flora Feng <4florafeng@gmail.com>

* [Bugfix] Defer block freeing until in-flight steps finish under async scheduling + PD KV consumer (#45357)

Signed-off-by: llx-08 <2596671364@qq.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com>

* nixl_ep: Skip post-receive quantization for NVFP4 (#45606)

Signed-off-by: Itay Alroy <ialroy@nvidia.com>

* [EP] Query NIXL EP top-k index dtype (#45298)

Signed-off-by: Itay Alroy <ialroy@nvidia.com>

* [EP] Enable DBO with NIXL EP (#45275)

Signed-off-by: Itay Alroy <ialroy@nvidia.com>

* [DSV4][Minor] Fix supported KV cache dtypes (#44892)

Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

* [Misc][Model] add io processor for query/document embeddings from ColBERT (jinaai/jina-colbert-v2) (#45210)

Signed-off-by: thomas <thomas.varghese@columbia.edu>

* [Rust Frontend] Support `max_logprobs` validation (#45674)

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

* [Rust Frontend] Lower out-of-vocab validation to `text` layer (#45685)

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

* [Multimodal] Add Qwen3-VL video loader (#44412)

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

* [BugFix] Support async scheduling with prompt embeds for multimodal models (#45673)

Signed-off-by: Ruinan Ma <r7ma3088@gmail.com>

* [XPU] Fix Triton attn fp8/bf16 check failing (#45758)

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

* [Bugfix][Gemma4] Fix offline parser truncation, adjust_request token leak, and chat template sync (#45553)

Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com>

* [Rust Frontend] Require `ModelConfig.vocab_size` to be present (#45696)

Signed-off-by: Bugen Zhao <i@bugenzhao.com>

* [Frontend] [Parser] Migrate Nemotron V3 to streaming parser engine  (#45755)

Signed-off-by: Ben Browning <bbrownin@redhat.com>

* [Core] Use fastsafetensors ParallelLoader for weight loading (#40183)

Signed-off-by: Git Bisector <gitbisector@gmail.com>
Signed-off-by: gitbisector <gitbisector@gmail.com>
Signed-off-by: git bisector <gitbisector@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>

* Register parsed config classes before tokenizer init (#40299)

Signed-off-by: Bortlesboat <bortstheboat@gmail.com>
Co-authored-by: OpenAI Codex <codex@openai.com>

* [Misc] Added validation for Cohere /v2/embed input field exclusivity (#45640)

Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com>

* [Cleanup] Remove dead env (#45777)

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

* [Bug Fix] Allow pinned memory for WSL2 (#41496)

Signed-off-by: Jimmy Lee <hirejimmylee@gmail.com>

* [CPU] Support Gemma Diffusion (#45690)

Signed-off-by: jiang1.li <jiang1.li@intel.com>

* [Bugfix] Prevent cuMemcpyBatchAsync segfault with MTP and KV offloading (#44784)

Signed-off-by: joshua <joshua.abraham@multicorewareinc.com>
Co-authored-by: joshua <joshua.abraham@multicorewareinc.com>
Co-authored-by: Or Ozeri <oro@il.ibm.com>

* [Frontend] Remove AsyncMicrobatchTokenizer. (#45759)

Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>

* [Bugfix] Fix trtllm fused allreduce+rms_norm for transformers backend (#45307)

Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>

* [XPU][CI] add intel xpu cases for nightly CI (#44372)

Signed-off-by: wenjun.liu <wenjun.liu@intel.com>
Signed-off-by: zengxian <xiangdong.zeng@intel.com>
Co-authored-by: zengxian <xiangdong.zeng@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>

* [Misc]Clean up useless test (#45792)

Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>

* Add Triton recompile detection (#45631)

Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>

* [MM][Perf][CG] Support dual-path ViT full CUDA graph for DeepSeek-OCR (#43586)

Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>

* [KV Connector][Mooncake] Pipeline-parallel support for PD-disaggregated serving with Mooncake connector (#44528)

Signed-off-by: hanhan.hank <hanhan.hank@bytedance.com>
Signed-off-by: Hank Han <hanhan7630@outlook.com>

* [Refactor] Remove `Fp8OnlineLinearMethod` as scheduled (#45463)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* [ZenCPU] Add zencpu Platform Runtime Logging and Docs (#42726)

Signed-off-by: Lalithnarayan C <Lalithnarayan.C@amd.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>

* [ROCm][CI] Gate incompatible HF references on Transformers v5 (#41532)

Signed-off-by: Andreas Karatzas <akaratza@amd.com>

* [Quant] Support modelopt_mixed on Ampere (SM80/SM86) (#45306)

Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com>

* [Bugfix][MoE] Restore routed output unpadding before shared expert add (#45707)

Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* [Perf] Add VLLM_TRITON_FORCE_FIRST_CONFIG to skip Triton autotuning (#42425)

Signed-off-by: Francesco Fusco <ffu@zurich.ibm.com>
Co-authored-by: Claude <noreply@anthropic.com>

* [CI] Fix attention benchmark smoke test (#45728)

Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Claude <noreply@anthropic.com>

* [Rust Frontend] Add CORS support (#45753)

Signed-off-by: Tahsin Tunan <tahsintunan@gmail.com>

* [Bugfix] Fix FlashMLA sparse accuracy with topk_length and zero-init padding (#36616)

Signed-off-by: AjAnubolu <anuboluajay@gmail.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>

* [Kernel][Helion][1/N] Add Helion kernel for rms_norm_per_block_quant (#36895)

Signed-off-by: Sean Chen <seachen@redhat.com>
Co-authored-by: Yanan Cao <gmagogsfm@gmail.com>

* feat: MLA prefill enable FA4 fp8 output (#43050)

Signed-off-by: Carl You <4531192+carlyou@users.noreply.github.com>

* [ROCm][Cleanup] Remove stale AITER FA hybrid KV-cache TODO (#44178)

Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>

* [Model] Add HrmTextForCausalLM (Hierarchical Reasoning Model — Text) (#43098)

Signed-off-by: Wuyifei <wuyifei@me.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>

* Upgrade tpu-inference to v0.22.1 (#45793)

* [ROCm][CI] Patch conftest to resolve occasional OOMs (#45722)

Signed-off-by: Micah Williamson <micah.williamson@amd.com>

* [Model Runner V2] Enable GraniteMOE for MRv2 by default (#45461)

Signed-off-by: yewentao256 <zhyanwentao@126.com>

* [Model] Remove Dots1ForCausalLM (#45637)

Signed-off-by: Xianbao QIAN <xianbao.qian@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>

* [Bugfix][Core] Fall back when numactl --membind is blocked in constrained containers (#45438)

Signed-off-by: Ting Sun <suntcrick@gmail.com>
Co-authored-by: Shengqi Chen <harry-chen@outlook.com>

* [KVConnector][MoRIIO] Allow overriding the advertised host IP (#45488)

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* [KV Connector][Mooncake] Add cache_prefix to namespace store keys (#45767)

Signed-off-by: Dao Le <Dao007forever@gmail.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* [Frontend] Add Streaming Parser Engine and new MinimaxM2 Parser (#45701)

Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>

* [Bugfix] Fix Qwen3 prompt tool-call reasoning false positive (#45763)

Signed-off-by: Alex Bilichenko <alexbi29@users.noreply.github.com>
Co-authored-by: Alex Bilichenko <alexbi29@users.noreply.github.com>

* [PERF] Fuse multi-group block table staged writes (#44944)

Signed-off-by: jesse <szxfml@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>

* [ROCm][Quant] mxfp8 moe/linear gfx950 tuning for MiniMax-M3 (#45725)

Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>

* [Misc] Update Mergify tool-calling label  (#45853)

Signed-off-by: sfeng33 <4florafeng@gmail.com>

* [Core] Add prefill step cadence for better non-PD DP balancing (#44558)

Signed-off-by: Nick Hill <nickhill123@gmail.com>

* [ROCm][CI] fix multimodel run cmds (#45858)

Signed-off-by: Divakar Verma <divakar.verma@amd.com>

* [Bugfix] Gemma4: skip forced JSON for required/named tool choice (#45795)

Signed-off-by: Federico Iezzi <fiezzi@google.com>

* [Kernel] Support GLM-5 dimensions for TRT-LLM ragged MLA prefill (#43525)

Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>

* Apply LRU policy only to proper cache entries (#42656)

Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>

* [Kernel] Support DS Mamba tail copy for MTP align mode (#45473)

Signed-off-by: Sungsoo Ha <sungsooh@nvidia.com>
Co-authored-by: Thomas Parnell <tom.parnell@gmail.com>

* [XPU][CI] fix server test file path (#45870)

Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>

* [Bugfix] Fix MoE model load OOM in FlashInfer_TRTLLM  backend with sleep mode (#45589)

Signed-off-by: Dakai An <dakaian108@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* [Bugfix][Gemma4] Fix parsing when thinking is disabled (#45832)

Signed-off-by: Federico Iezzi <fiezzi@google.com>

* [CI] Run pre-commit on self-hosted vllm-runners (#45865)

Signed-off-by: khluu <khluu000@gmail.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* [XPU] Fix test_spec_decode_logprobs: use FLASH_ATTN for XPU in GPU_DETERMINISM_KWARGS (#44468)

Signed-off-by: Chaojun Zhang <chaojun.zhang@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>

* [Bugfix][ROCm] Fix MiniMax-M3 FP8 KV cache dtype (#45720)

Signed-off-by: Cam Quilici <cjquilici@gmail.com>
Signed-off-by: Cameron Quilici <cjquilici@gmail.com>
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>

* [Bugfix][ROCm] Fix FP8 per-tensor scale rank mismatch causing Inductor assertion failure (#44912)

Signed-off-by: nehmathe2 <nehmathe2@gmail.com>
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
Signed-off-by: nehmathe <nehmathe@amd.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: Divakar Verma <divakar.verma@amd.com>
Co-authored-by: Andreas Karatzas <akaratza@amd.com>

* [ModelRunnerV2] Various model/config compatibility fixes (#45868)

Signed-off-by: Nick Hill <nickhill123@gmail.com>

* [Bugfix][V1] Clean up compiled-model bytecode hooks on VllmRunner exit (#45195)

Signed-off-by: Ting Sun <suntcrick@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* [FlexAttention] make custom mask mods fully cudagraphable (#45232)

Signed-off-by: Angel Li <liangel@meta.com>

* [M3] Tune Triton indexer score decode for spec-decode (#45743)

Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* [CI][NIXL] Pin NIXL to 1.2.0 (#45843)

Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Signed-off-by: Itay Alroy <75032521+itayalroy@users.noreply.github.com>
Co-authored-by: ovidiusm <ovidium@nvidia.com>

* [M3] Enable FP8 sparse GQA (#45744)

Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>

* [Bugfix][Quantization] Reject unsupported compressed tensors KV cache schemes (#45312)

Signed-off-by: Ting Sun <suntcrick@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* [BugFix][CI] Fix scheduler plugin test (#45897)

Signed-off-by: Nick Hill <nickhill123@gmail.com>

* [Rust Frontend] Support prompt-only completions (#44938)

Signed-off-by: reidliu41 <reid201711@gmail.com>

* [Rust Frontend] Add /abort_requests endpoint (#44382)

Signed-off-by: Sahil Singh <sahiilsiingh37@gmail.com>

* [Rust Frontend] Add serde defaults for omit_defaults fields in `EngineCoreSamplingParams` (#45848)

Signed-off-by: Will Eaton <weaton@redhat.com>

* [Kernel] Add weightless RMSNorm CUDA kernels for has_weight=False (#41430) (#44109)

Signed-off-by: hello-args <args.sarkar@gmail.com>

* [Misc] Validate Cohere Embed Mixed Content Payloads (#45873)

Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com>

* [Rust Frontend] Support hybrid/external DP LB in Python supervised bootstrap (#45805)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>

* [KV Connector][Offloading] Avoid blocking the engine to flush offloads on idle (#45595)

Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Or Ozeri <oro@il.ibm.com>
Signed-off-by: Itay Etelis <Itay.etelis@gmail.com>
Co-authored-by: Itay Etelis <itay.etelis@ibm.com>
Co-authored-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: Itay Etelis <Itay.etelis@gmail.com>

* [Bugfix] Fixes MiniCPM-O resampler device placement to avoid tensor device mismatch (#42332)

Signed-off-by: j9smith <j.smith9103@outlook.com>

* [Bugfix][Gemma4] Pre-initialise streaming reasoning state when prompt ends inside an open `<|channel>` (fixes #45834) (#45852)

Signed-off-by: nikhilesh-csa <nchhetri@csa1.com>

* [Bugfix][test] Use Salesforce/wikitext for ppl tests (#45913)

Co-authored-by: wentian-byte <192079369+wentian-byte@users.noreply.github.com>

* fix(security): enforce audio decode duration limit in chat completions path (#45908)

Signed-off-by: jperezde <jperezde@redhat.com>

* [ROCm][Bugfix]: Fallback GFX942 sparse MLA ops to Triton (#45782)

Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>

* docs, kv_offloading: add docs for selective offload (#45279)

Signed-off-by: Angelo Ruocco <ang@zurich.ibm.com>

* [ROCm][Quant] Minimax-M3:  Enable fp8_per_channel for bf16 weights on mi300x (#45854)

Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>

* [MM][Perf][CG] Support ViT full CUDA graph for Kimi-VL (#41992)

Signed-off-by: oguz <oguzhankir17@gmail.com>

* [CI/Build] Avoid duplicate ViT CG test introduced by accident (#45654)

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

* [XPU] Fix test_logprobs_e2e import error: pin lm-eval[api]>=0.4.12 (#44469)

Signed-off-by: Chaojun Zhang <chaojun.zhang@intel.com>

* [quant][autoround]Refactor INC quantization into package with INCScheme orchestrator (#40601)

Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: Zhenzhong1 <zhenzhong.xu@intel.com>
Signed-off-by: Zhenzhong Xu <zhenzhong.xu@intel.com>
Co-authored-by: n1ck-guo <heng.guo@intel.com>
Co-authored-by: Zhenzhong1 <zhenzhong.xu@intel.com>

* [ROCm][AITER][Quark] Tag per-channel FP8 weights as PER_CHANNEL so AITER pre-shuffled GEMM is selected (#44626)

Signed-off-by: Xavier Aguilar <xavier.aguilarfruto@amd.com>

* Feature: Enable Flashinfer non-gated MoE bf16 (#43853)

Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>

* [DSv4 Perf] DSv4 flashinfer sparse index cache for metadata, 2%~4% TTFT improvement (#45863)

Signed-off-by: yewentao256 <zhyanwentao@126.com>

* [Kernel][Helion][1/N] Add Helion kernel for rms_norm_dynamic_per_token_quant (#34432)

Signed-off-by: Sean Chen <seachen@redhat.com>
Co-authored-by: Yanan Cao <gmagogsfm@gmail.com>

* [Bugfix][PD] Fix DSV4 disaggregated serving (#45831)

Signed-off-by: ZhanqiuHu <zhu@redhat.com>

* [Bugfix] Pass TP group to FlashInfer all-reduce fusion (#45917)

Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>

* [Log] Update deepgemm log (#45857)

Signed-off-by: yewentao256 <zhyanwentao@126.com>

* [DSV4 Perf] Optimize dsv4 cudagraph by reducing `eager_break_during_capture`, 26.8% ~ 27.9% E2E TTFT improvement (#45309)

Signed-off-by: yewentao256 <zhyanwentao@126.com>

* [feature] MiniMax-M3-MXFP4 support added (#45896)

Signed-off-by: Qiang Li <qiang.li2@amd.com>

* [Bugfix] MiniMax-M3 (AMD): add packed_modules_mapping and pass swiglu… (#45794)

Signed-off-by: wangjiaxin99 <jiaxwang@amd.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
Co-authored-by: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com>

* [Refactor] Remove dead quantization code and tests (#45454)

Signed-off-by: yewentao256 <zhyanwentao@126.com>

* [Bugfix][Gemma4] Render reasoning on assistant turns without tool_calls (#45867)

Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com>

* [Bugfix][Model] Validate DefaultModelLoader / LoadConfig and fail with clear errors (#45196)

Signed-off-by: Ting Sun <suntcrick@gmail.com>

* [BUG] fix hidden states nan for hybrid attention models (#45849)

Signed-off-by: shanjiaz <hezhao@redhat.com>
Co-authored-by: shanjiaz <hezhao@redhat.com>

* [Bugfix] Fix NixlConnector handshake block_len validation for GQA-replicated KV heads (#45879)

Signed-off-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com>
Co-authored-by: waynehacking8 <waynehacking8@gmail.com>

* Revert "[DSV4 Perf] Optimize dsv4 cudagraph by reducing `eager_break_during_capture`" (#45309) (#45972)

Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* [XPU][CI] add model runner v2 into CI (#44650)

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>

* [CI/Build][Bugfix] Fix SD LoRA  (#45941)

Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>

* [Bugfix] Complete one-shot fused all-reduce PDL at end to avoid NaN (#45448)

* [Rust Frontend][Perf] O(n) argument scan in tool parser (#45826)

Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>

* [XPU] Fix FP8 block-scaled scheme selection on non-CUDA platforms (#43958)

Signed-off-by: Lai, Yejing <yejing.lai@intel.com>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>

* [Rust Frontend] Validate tokenized bad_words vocabulary range (#45876)

Signed-off-by: reidliu41 <reid201711@gmail.com>

* [CPUOffloading] Guard CPU eviction check (#45757)

Signed-off-by: Varun Sundar Rabindranath <varun-sundar-rabindranath@h100-01.nemg-001.lab.rdu2.dc.redhat.com>
Co-authored-by: Varun Sundar Rabindranath <varun-sundar-rabindranath@h100-01.nemg-001.lab.rdu2.dc.redhat.com>

* [SimpleCPUOffloadConnector]: Add support for reset_cache() (#39726)

Signed-off-by: Jonathan Chen <chenleejonathan@gmail.com>
Signed-off-by: Jonathan <chenleejonathan@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* [Kernel] Add PDL support for DeepGEMM kernel (#42996)

Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>

* [Fix][KV offload] Defer `on_request_finished` until in-flight transfers drain (#45823)

Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>

* [Refactor] Remove dead cutlass mxfp8 code (#44681)

Signed-off-by: yewentao256 <zhyanwentao@126.com>
Co-authored-by: Shengqi Chen <harry-chen@outlook.com>

* [KV Offloading] Remove dummy worker-side stats from OffloadingConnector (#45905)

Signed-off-by: Alex <alex.tech.lab@outlook.com>
Signed-off-by: AlexHuang <jihuihuang@alexai.com>
Co-authored-by: Or Ozeri <oro@il.ibm.com>

* [Test][KV Connector] Add request_finished fence population tests for offloading scheduler (#45679)

Signed-off-by: Alex <alex.tech.lab@outlook.com>
Signed-off-by: AlexHuang <jihuihuang@future.com>
Co-authored-by: Or Ozeri <oro@il.ibm.com>

* Revert "[Kernel] Add PDL support for DeepGEMM kernel" (#45999)

* [XPU] Update nixl to v0.10.1 in Dockerfile (#40287)

Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>

* fix(layernorm): route weightless RMSNorm to native impl

The vllm_c rms_norm/fused_add_rms_norm guards claimed support for
weight=None, but torch.ops._C.rms_norm cannot take a None/undefined weight
(fails with 'Not yet supported ScalarType'). Weightless norms (e.g. Gemma4
v_norm, has_weight=False) now correctly fall back to the native impl.

* test(steering): retarget key-coercion test at coerce_steering_spec

The SetSteeringRequest.vectors field is intentionally dict[str, Any] (to
admit the packed wire form), so the model does not coerce inner layer keys;
coerce_steering_spec does. Test the actual coercion seam (which had no
direct coverage) instead of obsolete model-level behavior.

* fix(capture): skip broken consumer entry points instead of crashing

A single third-party capture-consumer plugin that fails to import (e.g.
one referencing a module not present in this build) previously crashed
_load_entry_points and took down all capture admission. Skip it with a
warning so other consumers keep working.

---------

Signed-off-by: Sean Chen <seachen@redhat.com>
Signed-off-by: Wentian Byte <3400259131@qq.com>
Signed-off-by: RickyChen / 陳昭儒 <ricky.chen@infinirc.com>
Signed-off-by: Bugen Zhao <i@bugenzhao.com>
Signed-off-by: Ben Browning <bbrownin@redhat.com>
Signed-off-by: Xianbao QIAN <xianbao.qian@gmail.com>
Signed-off-by: yewentao256 <zhyanwentao@126.com>
Signed-off-by: Yifan Zong <yzong@redhat.com>
Signed-off-by: Dao Le <Dao007forever@gmail.com>
Signed-off-by: Neil Schemenauer <nas@arctrix.com>
Signed-off-by: jpwang <jpwang@smail.nju.edu.cn>
Signed-off-by: Hsiao-Yuan Chen <hy.c@Hsiao-YuandeMacBook-Pro.local>
Signed-off-by: littlecircle0730 <littlecircle0730@gmail.com>
Signed-off-by: littlecircle0730 <43994952+littlecircle0730@users.noreply.github.com>
Signed-off-by: Ting Sun <suntcrick@gmail.com>
Signed-off-by: Nick Hill <nickhill123@gmail.com>
Signed-off-by: Sasindharan Sankar <sasindharansankar@email.com>
Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: varun sundar rabindranath <vsundarr@redhat.com>
Signed-off-by: Chris Leonard <chleonar@redhat.com>
Signed-off-by: Divakar Verma <divakar.verma@amd.com>
Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com>
Signed-off-by: Matthew Bonanni <mbonanni@redhat.com>
Signed-off-by: allgather <all2allops@gmail.com>
Signed-off-by: Rohan Potdar <rohan.potdar@amd.com>
Signed-off-by: Rohan138 <rohanpotdar138@gmail.com>
Signed-off-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: yuwenzho <yuwen.zhou@intel.com>
Signed-off-by: Will Eaton <weaton@redhat.com>
Signed-off-by: Ma Jian <jian1.ma@intel.com>
Signed-off-by: mgoin <mgoin64@gmail.com>
Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com>
Signed-off-by: xiaguan <751080330@qq.com>
Signed-off-by: Fynn Schmitt-Ulms <fschmitt@redhat.com>
Signed-off-by: jperezde <jperezde@redhat.com>
Signed-off-by: Thomas Parnell <tpa@zurich.ibm.com>
Signed-off-by: Shengqi Chen <harry-chen@outlook.com>
Signed-off-by: Sunita Nadampalli <nadampal@amazon.com>
Signed-off-by: NickLucche <nlucches@redhat.com>
Signed-off-by: zhenwei-intel <zhenwei.liu@intel.com>
Signed-off-by: Kunshang Ji <kunshang.ji@intel.com>
Signed-off-by: Ethan Feng <ethan.fengch@gmail.com>
Signed-off-by: Thillai Chithambaram <thillaichithambaram.a@gmail.com>
Signed-off-by: Guan-Ming (Wesley) Chiu <guanmingchiu@gmail.com>
Signed-off-by: Guan-Ming (Wesley) Chiu <105915352+guan404ming@users.noreply.github.com>
Signed-off-by: srinivas_oo7 <sklinkedin0120@gmail.com>
Signed-off-by: Sai Sridhar <tarrasridhar1154@gmail.com>
Signed-off-by: Tahsin Tunan <tahsintunan@gmail.com>
Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>
Signed-off-by: Jonas I. Liechti <j-i-l@t4d.ch>
Signed-off-by: zixi-qi <zixi@inferact.ai>
Signed-off-by: Ryan Rock <ryan.rock@amd.com>
Signed-off-by: sfeng33 <4florafeng@gmail.com>
Signed-off-by: Micah Williamson <micah.williamson@amd.com>
Signed-off-by: Fangzhou Ai <fangzhouai@gmail.com>
Signed-off-by: khluu <khluu000@gmail.com>
Signed-off-by: Roger Wang <hey@rogerw.io>
Signed-off-by: Wayne Chiu <waynehacking8@gmail.com>
Signed-off-by: abinggo <107740309+abinggo@users.noreply.github.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Martin Hickey <martin.hickey@ie.ibm.com>
Signed-off-by: midas <the.anon.github@gmail.com>
Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Signed-off-by: achyuthan.s <113010327+Achyuthan-S@users.noreply.github.com>
Signed-off-by: Achyuthan S <achyuthan.sivasankar@gmail.com>
Signed-off-by: Achyuthan Sivasankar <achyuthan.sivasankar@gmail.com>
Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com>
Signed-off-by: Marceli Fylcek <marceli.fylcek@intel.com>
Signed-off-by: amd-asalykov <asalykov@amd.com>
Signed-off-by: Amanzhol Salykov <asalykov@amd.com>
Signed-off-by: ruinan ma <r7ma3088@gmail.com>
Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com>
Signed-off-by: Noa Neria <nneria@nvidia.com>
Signed-off-by: chaojun-zhang <chaojun.zhang@intel.com>
Signed-off-by: Chaojun Zhang <chaojun.zhang@intel.com>
Signed-off-by: Chaojun,Zhang <chaojun.zhang@intel.com>
Signed-off-by: baoloongmao <baoloongmao@tencent.com>
Signed-off-by: jiang1.li <jiang1.li@intel.com>
Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>
Signed-off-by: Sahil Singh <sahiilsiingh37@gmail.com>
Signed-off-by: Giancarlo Delfin <gdelfin@inferact.ai>
Signed-off-by: zhoujinyu <2319109590@qq.com>
Signed-off-by: reidliu41 <reid201711@gmail.com>
Signed-off-by: Lai, Yejing <yejing.lai@intel.com>
Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com>
Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io>
Signed-off-by: Martin Kukla <martin.kukla@cantab.net>
Signed-off-by: Xin He <xin3.he@intel.com>
Signed-off-by: Mike G <180722391+mikekg@users.noreply.github.com>
Signed-off-by: Saddss <2872669061@qq.com>
Signed-off-by: youkaichao <youkaichao@gmail.com>
Signed-off-by: Jee Jee Li <pandaleefree@gmail.com>
Signed-off-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
Signed-off-by: Yongye Zhu <zyy1102000@gmail.com>
Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai>
Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
Signed-off-by: factnn <166481866+factnn@users.noreply.github.com>
Signed-off-by: llx-08 <2596671364@qq.com>
Signed-off-by: Itay Alroy <ialroy@nvidia.com>
Signed-off-by: thomas <thomas.varghese@columbia.edu>
Signed-off-by: Ruinan Ma <r7ma3088@gmail.com>
Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com>
Signed-off-by: Git Bisector <gitbisector@gmail.com>
Signed-off-by: gitbisector <gitbisector@gmail.com>
Signed-off-by: git bisector <gitbisector@gmail.com>
Signed-off-by: Bortlesboat <bortstheboat@gmail.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Jimmy Lee <hirejimmylee@gmail.com>
Signed-off-by: joshua <joshua.abraham@multicorewareinc.com>
Signed-off-by: wenjun.liu <wenjun.liu@intel.com>
Signed-off-by: zengxian <xiangdong.zeng@intel.com>
Signed-off-by: wangxiyuan <wangxiyuan1007@gmail.com>
Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg>
Signed-off-by: shen-shanshan <467638484@qq.com>
Signed-off-by: hanhan.hank <hanhan.hank@bytedance.com>
Signed-off-by: Hank Han <hanhan7630@outlook.com>
Signed-off-by: Lalithnarayan C <Lalithnarayan.C@amd.com>
Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Signed-off-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Signed-off-by: Francesco Fusco <ffu@zurich.ibm.com>
Signed-off-by: AjAnubolu <anuboluajay@gmail.com>
Signed-off-by: Carl You <4531192+carlyou@users.noreply.github.com>
Signed-off-by: Tuukka Sarvi <tuukka.sarvi@amd.com>
Signed-off-by: Wuyifei <wuyifei@me.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Signed-off-by: Alex Bilichenko <alexbi29@users.noreply.github.com>
Signed-off-by: jesse <szxfml@gmail.com>
Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
Signed-off-by: Federico Iezzi <fiezzi@google.com>
Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
Signed-off-by: Stanislaw Wozniak <stw@zurich.ibm.com>
Signed-off-by: Sungsoo Ha <sungsooh@nvidia.com>
Signed-off-by: Dakai An <dakaian108@gmail.com>
Signed-off-by: Cam Quilici <cjquilici@gmail.com>
Signed-off-by: Cameron Quilici <cjquilici@gmail.com>
Signed-off-by: nehmathe2 <nehmathe2@gmail.com>
Signed-off-by: nehmathe <nehmathe@amd.com>
Signed-off-by: Angel Li <liangel@meta.com>
Signed-off-by: Itay Alroy <75032521+itayalroy@users.noreply.github.com>
Signed-off-by: hello-args <args.sarkar@gmail.com>
Signed-off-by: Itay Etelis <itay.etelis@ibm.com>
Signed-off-by: Or Ozeri <oro@il.ibm.com>
Signed-off-by: Itay Etelis <Itay.etelis@gmail.com>
Signed-off-by: j9smith <j.smith9103@outlook.com>
Signed-off-by: nikhilesh-csa <nchhetri@csa1.com>
Signed-off-by: Angelo Ruocco <ang@zurich.ibm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: oguz <oguzhankir17@gmail.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: Zhenzhong1 <zhenzhong.xu@intel.com>
Signed-off-by: Zhenzhong Xu <zhenzhong.xu@intel.com>
Signed-off-by: Xavier Aguilar <xavier.aguilarfruto@amd.com>
Signed-off-by: Amir Klein <203507526+amirkl94@users.noreply.github.com>
Signed-off-by: ZhanqiuHu <zhu@redhat.com>
Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com>
Signed-off-by: Qiang Li <qiang.li2@amd.com>
Signed-off-by: wangjiaxin99 <jiaxwang@amd.com>
Signed-off-by: shanjiaz <hezhao@redhat.com>
Signed-off-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com>
Signed-off-by: Varun Sundar Rabindranath <varun-sundar-rabindranath@h100-01.nemg-001.lab.rdu2.dc.redhat.com>
Signed-off-by: Jonathan Chen <chenleejonathan@gmail.com>
Signed-off-by: Jonathan <chenleejonathan@gmail.com>
Signed-off-by: Alex <alex.tech.lab@outlook.com>
Signed-off-by: AlexHuang <jihuihuang@alexai.com>
Signed-off-by: AlexHuang <jihuihuang@future.com>
Co-authored-by: Xiaohong (Sean) Chen <seachen@redhat.com>
Co-authored-by: Yanan Cao <gmagogsfm@gmail.com>
Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com>
Co-authored-by: wentian-byte <3400259131@qq.com>
Co-authored-by: Chao-Ju Chen <ricky.chen@infinirc.com>
Co-authored-by: Bugen Zhao <i@bugenzhao.com>
Co-authored-by: Ben Browning <bbrownin@redhat.com>
Co-authored-by: Tiezhen WANG <38108242+xianbaoqian@users.noreply.github.com>
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Co-authored-by: yzong-rh <yzong@redhat.com>
Co-authored-by: Dao007forever <dao007forever@gmail.com>
Co-authored-by: Neil Schemenauer <nas-github@arctrix.com>
Co-authored-by: jpwang <jpwang@smail.nju.edu.cn>
Co-authored-by: littlecircle0730 <43994952+littlecircle0730@users.noreply.github.com>
Co-authored-by: Hsiao-Yuan Chen <hy.c@Hsiao-YuandeMacBook-Pro.local>
Co-authored-by: Or Ozeri <or@ozery.com>
Co-authored-by: Fangzhou Ai <31551580+Fangzhou-Ai@users.noreply.github.com>
Co-authored-by: vLLM Contributor <contributor@vllm.ai>
Co-authored-by: Ting SUN <suntcrick@gmail.com>
Co-authored-by: Nick Hill <nickhill123@gmail.com>
Co-authored-by: sasindharan <117493393+sasindharan@users.noreply.github.com>
Co-authored-by: Sasindharan Sankar <sasindharansankar@email.com>
Co-authored-by: Chauncey <chaunceyjiang@gmail.com>
Co-authored-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Co-authored-by: Varun Sundar Rabindranath <varunsundar08@gmail.com>
Co-authored-by: varun sundar rabindranath <vsundarr@redhat.com>
Co-authored-by: Chris Leonard <chleonar@redhat.com>
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com>
Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com>
Co-authored-by: Martin Kukla <martin.kukla@cantab.net>
Co-authored-by: Matthew Bonanni <mbonanni@redhat.com>
Co-authored-by: Dipika Sikka <dsikka@redhat.com>
Co-authored-by: NickLucche <nlucches@redhat.com>
Co-authored-by: jiahanc <173873397+jiahanc@users.noreply.github.com>
Co-authored-by: Alec Kohlhoff <134344302+aleckohlhoff@users.noreply.github.com>
Co-authored-by: Porras Huang <20535584+porrashuang@users.noreply.github.com>
Co-authored-by: scoootscooob <167050519+scoootscooob@users.noreply.github.com>
Co-authored-by: allgather <all2allops@gmail.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: Rohan Potdar <66227218+Rohan138@users.noreply.github.com>
Co-authored-by: Yuwen Zhou <yuwen.zhou@intel.com>
Co-authored-by: Will Eaton <wseaton@users.noreply.github.com>
Co-authored-by: Ma Jian <jian1.ma@intel.com>
Co-authored-by: Michael Goin <mgoin64@gmail.com>
Co-authored-by: cjackal <44624812+cjackal@users.noreply.github.com>
Co-authored-by: JinYan Su <jinyansu792@gmail.com>
Co-authored-by: Fynn Schmitt-Ulms <fschmitt@redhat.com>
Co-authored-by: Juan Pérez de Algaba <124347725+jperezdealgaba@users.noreply.github.com>
Co-authored-by: Thomas Parnell <tpa@zurich.ibm.com>
Co-authored-by: Shengqi Chen <harry-chen@outlook.com>
Co-authored-by: snadampal <87143774+snadampal@users.noreply.github.com>
Co-authored-by: liuzhenwei <zhenweiliu@habana.ai>
Co-authored-by: Kunshang Ji <kunshang.ji@intel.com>
Co-authored-by: Ethan Feng <ethan.fengch@gmail.com>
Co-authored-by: Thillai Chithambaram <79466435+thillai-c@users.noreply.github.com>
Co-authored-by: Mark McLoughlin <markmc@redhat.com>
Co-authored-by: Guan-Ming (Wesley) Chiu <105915352+guan404ming@users.noreply.github.com>
Co-authored-by: Srinivas Krovvidi <194645829+Srinivasoo7@users.noreply.github.com>
Co-authored-by: srinivas_oo7 <sklinkedin0120@gmail.com>
Co-authored-by: Sai Sridhar Tarra <117087864+sridhar-3009@users.noreply.github.com>
Co-authored-by: Tahsin Tunan <tahsintunan@gmail.com>
Co-authored-by: Yi Zhong <207368749+vincentzed@users.noreply.github.com>
Co-authored-by: Jonas I. Liechti <j-i-l@t4d.ch>
Co-authored-by: qizixi <22851944+zixi-qi@users.noreply.github.com>
Co-authored-by: Yongye Zhu <zyy1102000@gmail.com>
Co-authored-by: Ryan Rock <ryan.rock@amd.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Flora Feng <4florafeng@gmail.com>
Co-authored-by: Micah Williamson <micah.williamson@amd.com>
Co-authored-by: Kevin H. Luu <khluu000@gmail.com>
Co-authored-by: WEI CHENG CHIU <waynehacking8@gmail.com>
Co-authored-by: longguo <107740309+abinggo@users.noreply.github.com>
Co-authored-by: Roger Wang <hey@rogerw.io>
Co-authored-by: Andreas Karatzas <akaratza@amd.com>
Co-authored-by: Martin Hickey <martin.hickey@ie.ibm.com>
Co-authored-by: midas <the.anon.github@gmail.com>
Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com>
Co-authored-by: achyuthan.s <113010327+Achyuthan-S@users.noreply.github.com>
Co-authored-by: Xin Li <xinli-sw@users.noreply.github.com>
Co-authored-by: ShawRong <ShawRong@users.noreply.github.com>
Co-authored-by: Change72 <Change72@users.noreply.github.com>
Co-authored-by: Benjamin Chislett <chislett.ben@gmail.com>
Co-authored-by: Jeff (Junze) Ma <93145857+majunze2001@users.noreply.github.com>
Co-authored-by: Marceli Fylcek <marceli.fylcek@intel.com>
Co-authored-by: Amanzhol Salykov <asalykov@amd.com>
Co-authored-by: Michael Ma <97484148+mrn3088@users.noreply.github.com>
Co-authored-by: Taneem Ibrahim <taneem.ibrahim@gmail.com>
Co-authored-by: Noa Neria <nneria@nvidia.com>
Co-authored-by: Chaojun Zhang <chaojun.zhang@intel.com>
Co-authored-by: maobaolong <baoloongmao@tencent.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
Co-authored-by: Peter Pan <peter.pan@daocloud.io>
Co-authored-by: Sahil Singh <sahiilsiingh37@gmail.com>
Co-authored-by: Giancarlo Delfin <32987265+TheEpicDolphin@users.noreply.github.com>
Co-authored-by: FAUST <2319109590@qq.com>
Co-authored-by: Reid <61492567+reidliu41@users.noreply.github.com>
Co-authored-by: Yejing Lai <yejing.lai@intel.com>
Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com>
Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io>
Co-authored-by: Xin He <xin3.he@intel.com>
Co-authored-by: Mike G <180722391+mikekg@users.noreply.github.com>
Co-authored-by: Michael Gschwind <mgschwind@nvidia.com>
Co-authored-by: Saddss <108515797+Saddss@users.noreply.github.com>
Co-authored-by: RoyWang <Roy.Wang@amd.com>
Co-authored-by: youkaichao <youkaichao@gmail.com>
Co-authored-by: OpenAI Codex <codex@openai.com>
Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg>
Co-authored-by: Jee Jee Li <pandaleefree@gmail.com>
Co-authored-by: functionstackx <47992694+functionstackx@users.noreply.github.com>
Co-authored-by: Jee Jee Li <jeejeelee@inferact.ai>
Co-authored-by: Ronen Schaffer <ronen.schaffer@ibm.com>
Co-authored-by: Woosuk Kwon <woosuk@inferact.ai>
Co-authored-by: Zang Peiyu <166481866+factnn@users.noreply.github.com>
Co-authored-by: llx <54896441+llx-08@users.noreply.github.com>
Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com>
Co-authored-by: Itay Alroy <75032521+itayalroy@users.noreply.github.com>
Co-authored-by: xx-thomas <113865951+xx-thomas@users.noreply.github.com>
Co-authored-by: Luciano Martins <22145370+lucianommartins@users.noreply.github.com>
Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com>
Co-authored-by: gitbisector <gitbisector@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Andrew Barnes <bortstheboat@gmail.com>
Co-authored-by: Jimmy Lee <58957694+thisisjimmyfb@users.noreply.github.com>
Co-authored-by: joshua abraham <132982099+JOSH1024@users.noreply.github.com>
Co-authored-by: joshua <joshua.abraham@multicorewareinc.com>
Co-authored-by: Or Ozeri <oro@il.ibm.com>
Co-authored-by: wenjun liu <wenjun.liu@intel.com>
Co-authored-by: zengxian <xiangdong.zeng@intel.com>
Co-authored-by: wangxiyuan <wangxiyuan1007@gmail.com>
Co-authored-by: Shanshan Shen <467638484@qq.com>
Co-authored-by: Hank Han <hanhan7630@outlook.com>
Co-authored-by: Lalithnarayan C <Lalithnarayan.C@amd.com>
Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>
Co-authored-by: Netanel Haber <58652339+netanel-haber@users.noreply.github.com>
Co-authored-by: Francesco Fusco <ffu@zurich.ibm.com>
Co-authored-by: Ajay Anubolu <124525760+AjAnubolu@users.noreply.github.com>
Co-authored-by: Carl Y <4531192+carlyou@users.noreply.github.com>
Co-authored-by: Tuukka Sarvi <tuukka.sarvi@amd.com>
Co-authored-by: yifei wu <50608184+abcd1927@users.noreply.github.com>
Co-authored-by: Sting Lin <sting.lin@cienet.com>
Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Co-authored-by: alexbi29 <32223381+alexbi29@users.noreply.github.com>
Co-authored-by: Alex Bilichenko <alexbi29@users.noreply.github.com>
Co-authored-by: Song Zhixin <szxfml@gmail.com>
Co-authored-by: Hongxia Yang <62075498+hongxiayang@users.noreply.github.com>
Co-authored-by: Federico <federico.iezzi@gmail.com>
Co-authored-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com>
Co-authored-by: Stan Wozniak <77159600+s3woz@users.noreply.github.com>
Co-authored-by: sungsoo ha <sungsooh@nvidia.com>
Co-authored-by: Thomas Parnell <tom.parnell@gmail.com>
Co-authored-by: Dakai An <77474977+andakai@users.noreply.github.com>
Co-authored-by: Federico <fiezzi@google.com>
Co-authored-by: Cameron Quilici <cjquilici@gmail.com>
Co-authored-by: TJian <tunjian.tan@embeddedllm.com>
Co-authored-by: nehmathe2 <nehmathe@amd.com>
Co-authored-by: Divakar Verma <divakar.verma@amd.com>
Co-authored-by: liangel-02 <liangel@meta.com>
Co-authored-by: ovidiusm <ovidium@nvidia.com>
Co-authored-by: arghyadeep sarkar <args.sarkar@gmail.com>
Co-authored-by: Itay Etelis <92247226+Etelis@users.noreply.github.com>
Co-authored-by: Itay Etelis <itay.etelis@ibm.com>
Co-authored-by: Itay Etelis <Itay.etelis@gmail.com>
Co-authored-by: Joel Smith <j.smith9103@outlook.com>
Co-authored-by: Nikhilesh Chhetri <106703537+nikhilesh-csa@users.noreply.github.com>
Co-authored-by: wentian-byte <192079369+wentian-byte@users.noreply.github.com>
Co-authored-by: Angelo Ruocco <ang@zurich.ibm.com>
Co-authored-by: Oğuzhan KIR <86883236+oguzhankir@users.noreply.github.com>
Co-authored-by: Yi Liu <yi4.liu@intel.com>
Co-authored-by: n1ck-guo <heng.guo@intel.com>
Co-authored-by: Zhenzhong1 <zhenzhong.xu@intel.com>
Co-authored-by: xaguilar-amd <xavier.aguilarfruto@amd.com>
Co-authored-by: amirkl94 <203507526+amirkl94@users.noreply.github.com>
Co-authored-by: zhanqiuhu <49648934+ZhanqiuHu@users.noreply.github.com>
Co-authored-by: danisereb <daserebrenik@nvidia.com>
Co-authored-by: qli88 <qiang.li2@amd.com>
Co-authored-by: wangjiaxin99 <jiaxwang@amd.com>
Co-authored-by: Douglas Lehr <91553416+dllehr-amd@users.noreply.github.com>
Co-authored-by: shanjiaz <zsjwpianpian@gmail.com>
Co-authored-by: shanjiaz <hezhao@redhat.com>
Co-authored-by: Bryan Shan <58582368+Oseltamivir@users.noreply.github.com>
Co-authored-by: Ace Eldeib <aeldeib@coreweave.com>
Co-authored-by: Varun Sundar Rabindranath <varun-sundar-rabindranath@h100-01.nemg-001.lab.rdu2.dc.redhat.com>
Co-authored-by: Jonathan Chen <chenleejonathan@gmail.com>
Co-authored-by: AlexHuang <alex.tech.lab@outlook.com>
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
…mi300x (vllm-project#45854)

Signed-off-by: Hongxia Yang <hongxia.yang@amd.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

2 participants