[Bugfix] Use storage_block_size in KV cache reshape for compressed specs (DeepSeek V4)#43988
Merged
njhill merged 1 commit intoMay 29, 2026
Conversation
njhill
approved these changes
May 29, 2026
njhill
left a comment
Member
There was a problem hiding this comment.
Thanks @zixi-qi
cc @MengqingCao
…ecs (DeepSeek V4) Signed-off-by: zixi-qi <zixi@inferact.ai> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
b40940d to
9432b56
Compare
jasonboukheir
added a commit
to jasonboukheir/vllm
that referenced
this pull request
May 29, 2026
… pools Squashed cherry-pick of @lesj0610's open PR (8 commits at head a6929d8) that introduces a per-pool KV cache layout: TOKEN_PROPORTIONAL pool for attention K/V and a compact REQUEST_CONSTANT pool for Mamba/GDN state (O(1) per request, sized by max_num_seqs * blocks_per_request). Removes the page-size unification that forced attention block_size up to match the Mamba state page on hybrids — attention now keeps its native block_size and stops fighting GDN for slots. Validated upstream on Qwen3.5/3.6 dense GDN (1.3-1.4x GPU KV tokens). mamba_cache_mode="all" keeps the legacy shared-pool path for callers that want prefix caching of mamba state. Conflict resolved in vllm/v1/core/single_type_kv_cache_manager.py: extended the PR's _should_record_new_block_ids_for_zeroing helper to include MLAAttentionSpec (added upstream in vllm-project#43988 after the PR's base). Co-Authored-By: Sungbin Lee <lesj0610@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
Liuweixiong0118
pushed a commit
to Liuweixiong0118/vllm
that referenced
this pull request
Jun 1, 2026
…ecs (DeepSeek V4) (vllm-project#43988) Signed-off-by: zixi-qi <zixi@inferact.ai> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Liuweixiong0118 <lwx34158427@gmail.com>
hynky1999
added a commit
to macrodata-labs/vllm
that referenced
this pull request
Jun 2, 2026
* [MM] Enable FlashInfer metadata support for Qwen2.5-VL vision attention (#42787) Signed-off-by: Hua Huang <huah@nvidia.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> * [Docs] Fix stale version number in token_embed.md (#43488) Signed-off-by: holegots <ikun3.1415927@gmail.com> * [Docs] Fix stale version number in token_classify.md (#43489) Signed-off-by: holegots <ikun3.1415927@gmail.com> * [MoE] Migrate W4A8 CT to oracle kernel setup (#42680) Signed-off-by: Siddharth Bedekar <bedeksid@gmail.com> Co-authored-by: OpenAI Codex <codex@openai.com> * [Mooncake] Add metrics for MooncakeStoreConnector operations (#43392) * [ROCm][Critical] Fix the GDN import bug (#43486) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> * Revert "[Misc] add humming to dependencies" (#43492) * [Bugfix] Fix reasoning dropped on streaming boundary deltas (#42691) Signed-off-by: sfeng33 <4florafeng@gmail.com> * [Model Runner v2] Force v1 runner for tests (#43233) Signed-off-by: yewentao256 <zhyanwentao@126.com> * [KV Connector] Keep MooncakeStore full hits block-aligned (#43494) Signed-off-by: Dao Le <daole@inferact.ai> Signed-off-by: Dao Le <Dao007forever@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> * [kv_offload]: Add DSv4 support (#43142) Signed-off-by: Or Ozeri <oro@il.ibm.com> * [ROCm][CI] Stabilize 400 error return code for invalid schema inputs (#43016) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [ROCm] [DSv4] [Perf] Support DeepSeek v4 MTP (#43385) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> * Tuning script and configs for Triton Mamba SSU kernel (#43083) Signed-off-by: Banani Ghosh <bg2502@nyu.edu> Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com> Co-authored-by: Banani Ghosh <bg2502@nyu.edu> * File system secondary tier implemented in python (#41735) Signed-off-by: Rotem Shavitt <rshavitt@gmail.com> Signed-off-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Or Ozeri <oro@il.ibm.com> * [Kernel] Add mhc_pre_big_fuse_with_norm_tilelang (#43474) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> * fix: MoE model using shared routed experts crashes on AMD GPUs (#42373) Signed-off-by: weizhou.lan@daocloud.io <weizhou.lan@daocloud.io> * [Docs] Reorganize offline inference docs. (#43552) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Docker] Non-root support for vllm-openai; add opt-in vllm-openai-nonroot target (#40275) Signed-off-by: TheDuyIT <nduy250299@gmail.com> Signed-off-by: dtnguyen <dtnguyen@nvidia.com> Co-authored-by: Claude <noreply@anthropic.com> * [Feat][KVConnector] Support DSV4 in SimpleCPUOffloadBackend (#42296) Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> * [Doc] Add section on escalating stalled contributions (#43568) Signed-off-by: esmeetu <jasonailu87@gmail.com> * Reduce memory usage for granite_speech. (#42933) Signed-off-by: Yihuki <wangbovbvb@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [KV Connector] Handle Mooncake finish after preemption (#43281) Signed-off-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Zhewen Li <zhewenli@inferact.ai> * [Misc] Print accuracy value for PD tests even on success (#43583) Signed-off-by: NickLucche <nlucches@redhat.com> * [Kernel] Remove NormGateLinear (#43554) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> * [XPU] Ensure RNG offset alignment with PyTorch requirements in XPU sampler (#43028) Signed-off-by: chaojun-zhang <chaojun.zhang@intel.com> Signed-off-by: Chaojun Zhang <chaojun.zhang@intel.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [LoRA] Add one shot triton kernel For MoE LoRA (#42290) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> * [DeepSeek V4] Move MegaMoE input prep kernel to nvidia/ops (#43632) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [KV Connector][Bugfix] MooncakeStore: don't double-apply Eagle prune in load_mask (#43516) Signed-off-by: Dao Le <daole@inferact.ai> Signed-off-by: Dao Le <Dao007forever@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> * [KV Connector] Propagate MooncakeStore load failures (#42788) Signed-off-by: Dao Le <Dao007forever@gmail.com> * [Bugfix] fix device mismatch in MiniCPM-o-4_5 resampler (#43194) Signed-off-by: Yan Ma <yan.ma@intel.com> * [Frontend] Split the offline inference APIs and utils. (#43553) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [Bugfix][Model] Fix GPT2ForSequenceClassification sub-module prefix (#43579) Signed-off-by: QingZhou-YangHY <3868850350@qq.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> * [GDN] GDN Prefill kernel for SM100 (#43273) Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg> * [CPU] Enable non-divisible GQA for decode workitems in mixed batches (#43032) Signed-off-by: zhejiangxiaomai <zhenhui.zhao@intel.com> * Upgrade tpu-inference to v0.20.0 (#43394) * Add CuTe DSL sparse compressor support (#43584) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: OpenAI Codex <codex@openai.com> Co-authored-by: Yongye Zhu <zyy1102000@gmail.com> * [chores][log] change registry log from `warning` to `debug` (#43045) Signed-off-by: Hank <hcc.mayday@gmail.com> * [Bugfix] Apply fc_norm in Eagle3DeepseekV2 combine_hidden_states (#43482) Signed-off-by: Yubo Wang <yubowang2019@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> * [KV Transfer] Enable HMA by default for connectors that support it (#41847) Signed-off-by: Ethan Feng <ethan.fengch@gmail.com> * [Misc][Refactor][ROCm] Convert MoRI-related envvars to extra config args (#43303) Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com> * [Misc] Support interleaved custom image benchmark datasets (#43636) Signed-off-by: ThibaultCastells <thib.castells@icloud.com> * [Reasoning] [Bugfix] Reject invalid thinking_token_budget values (#43402) Signed-off-by: linzm1007 <linzm1007@126.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [Model] Use AutoWeightsLoader for InternLM2 (#38278) Signed-off-by: Jesus De Jesus <dejesus.9297@gmail.com> Signed-off-by: javierdejesusda <javier.dejesusj9@gmail.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> * [XPU] Fix fused MoE LoRA kernel crash on XPU by using platform-agnos num_compute_units (#43646) Signed-off-by: Chaojun,Zhang <chaojun.zhang@intel.com> * Fix CuPy runtime deps and restore humming (#43530) Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> * [Docs][ROCm] MoRI-IO Connector Usage Guide (#43603) Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> Signed-off-by: Simon Danielsson <70206058+simondanielsson@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [ROCm][CI] Extend ROCm quick reduce coverage (#40990) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [Feat][DSV4] Fuse q pad into deepseek v4 fused kernel (#43162) * [MoE Refactor] Migrate ModelOptMxFp8FusedMoE to oracle (#42768) Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> * [MoE Refactor] W4a8 int8 oracle (#42789) Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> * [ROCm] Remove MegaMoE integration in deepseek v4 (#43629) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * Add LM head quantization support for ModelOpt (#42124) Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com> * [Doc] Add line limit to AGENTS.md (#43635) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com> * [DSv4] Drop _get_compressed_kv_buffer in DeepseekCompressor (#43690) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [CI] Soft-fail AMD entrypoints mirror tests (#43709) Signed-off-by: Kevin Luu <kevin@inferact.ai> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [Kernel] Porting fuse_minimax_qk_norm to manual fusion (#43410) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> * [KV Connector] MooncakeStore: drop dead discard_partial_chunks parameter (#43627) Signed-off-by: Zhewen Li <zhewen@inferact.ai> Co-authored-by: Zhewen Li <zhewen@inferact.ai> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [Bugfix][V1] Fix TOCTOU race causing intermittent `EADDRINUSE` on multi-API-server DP startup (#42585) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * [ci] Add arm64 ci image (#41303) Signed-off-by: khluu <khluu000@gmail.com> Signed-off-by: Kevin H. Luu <khluu000@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [Bugfix] Split attention groups by num_heads_q for spec-decode drafts (#43543) Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com> * [Rust Frontend] Add reasoning/tool parser & renderer roundtrip tests (#43582) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * [ROCm][CI] Fix ROCm multimodal Qwen2.5-VL activation compile and Phi4MM ragged image mask handling (#43647) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [Perf] Optimize Fp8BlockScaledMMLinearKernel input_scale tensor using new_empty() (#43677) Signed-off-by: Xin Yang <xyangx@amazon.com> * [Attention] Make FlexAttention and FlashAttention use num-blocks first layouts (#42095) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> * [MLA][Attention] Add OOT MLA prefill backend registration mechanism (#43325) Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> * [Deprecation] Deprecate functions as scheduled for v0.21.0 (#43358) Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [DSv4] Refactor compressor & Fix ROCm compatibility (#43710) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * Fix test_aot_compile for torch 2.12 (#43695) Signed-off-by: Angela Yi <yiangela7@gmail.com> * [KVConnector][Mooncake] Wire reset_cache cascade end-to-end (#42694) Signed-off-by: aoshen524 <aoshen524@gmail.com> Signed-off-by: Ao Shen <aoshen@inferact.ai> Co-authored-by: aoshen524 <aoshen524@gmail.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [ROCm][Perf] Expose AITER MoE sorting dispatch policy via env var (#39177) Signed-off-by: nholmber <nholmber@users.noreply.github.com> * [MRV2][BugFix] Fix KV connector handling in spec decode case (#43719) Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> * [Frontend] Add MiniCPM5 XML tool call parser (#43175) Signed-off-by: zhangtao <zhangtao2@modelbest.cn> Signed-off-by: zhangtao2 <zhangtao2@modelbest.cn> Co-authored-by: zhangtao <zhangtao2@modelbest.cn> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> * [ROCm][GPT-OSS] Avoid repeated compile-time `cos_sin_cache.to(bf16)` casts in rotary path (#42833) Signed-off-by: Aakif Nawaz <aakif.nawaz@amd.com> * [Doc] Add Ascend NPU tab to the quickstart installation guide (#43550) Signed-off-by: Aditya Singh <adisin650@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [Rust Frontend] Align tool parser fallback behavior between streaming & non-streaming paths (#43662) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * [Docs] Fix MLA prefill backend default docs (#43697) Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> * [Kernel] Enable TritonW4A16LinearKernel as CUDA fallback for non-Marlin-aligned W4A16 shapes (#43731) Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com> * [Bugfix] Map reasoning_effort to enable_thinking in chat template kwargs (#43401) Signed-off-by: Ashwin Giridharan <girida@amazon.com> Signed-off-by: Chauncey <chaunceyjiang@gmail.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> * [misc] Bump cutedsl version to 4.5.2 (#43745) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> * [BugFix] HFValidationError with cloud storage URIs when HF_HUB_OFFLINE=1 (#39155) Signed-off-by: Injae Ryou <injaeryou@gmail.com> * [Docs] Fix the duplicate doc icon issue (#43546) Signed-off-by: chunyang.wen <chunyang.wen@gmail.com> * Fix early CUDA init (#43791) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [ROCm] mori: add InterNodeV1LL inter-node kernel selection via VLLM_MORI_INTERNODE_KERNEL (#41751) Signed-off-by: jatseng-ai <jatseng@amd.com> * [8/n] Migrate merge_attn_states, mamba, sampler to torch stable ABI (continued) (#43361) Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Signed-off-by: Chris Leonard <chleonar@redhat.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> * [Quantization] Fix Humming RoutedExperts import (#43540) Signed-off-by: Minh Vu <vuhoangminh97@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> * Remove Transformers forward/backward compatibility tests (#43785) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * Validate against some config fields being set to 0 (#43794) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Bugfix][DFlash]allocate the proper number of lookahead slots (#43733) Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com> Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com> * Fix Qwen3-VL and Qwen3-omni-thinker accuracy degradation from deepstack inputs under torch.compile (#43617) Signed-off-by: Dakai An <dakaian108@gmail.com> * Add @AndreasKaratzas to CODEOWNERS (#43740) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [Bugfix][Kernel] TRTLLM NVFP4 MoE chunking (#43599) Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com> * [ModelRunnerV2][Hybrid model] Support kernel block size in hybrid model (#38831) Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> * [Rust Frontend] Introduce mock engine for benchmark baseline (#43469) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * Fix RunAI streamer tensor buffer reuse during weight loading (#43464) Signed-off-by: bbartels <benjamin@bartels.dev> * [MoE] Remove inplace fused experts mechanism (#43727) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> * [Misc][Rocm] Remove redundant `AiterUnifiedAttentionBackend` block size log (#43664) Signed-off-by: NickLucche <nlucches@redhat.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [ROCm][CI] Stabilize Cargo cache and pre-test image checks (#43815) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * fix: parse Qwen3 XML JSON arguments first (#43243) Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> Co-authored-by: Flora Feng <4florafeng@gmail.com> * [Bugfix] Pass `routed_scaling_factor` to FlashInfer TRTLLM BF16 MoE (#43769) * [BugFix] Fix blocked reasoning parsing with MRV2 (#43808) Signed-off-by: Nick Hill <nickhill123@gmail.com> * [Bugfix][Frontend] streaming tool-call serializer drops first args chunk when name and args share a DeltaMessage (#42683) Signed-off-by: ignaciosica <mignacio.sica@gmail.com> Signed-off-by: sfeng33 <4florafeng@gmail.com> Co-authored-by: sfeng33 <4florafeng@gmail.com> * minor docs: fix incorrect example path (#43830) Signed-off-by: JINO-ROHIT <find.jinorohit@gmail.com> * [ROCm][DSV4] Enable Tilelang MHC replacing torch/triton mhc (#43679) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> * change name of fs_python secondary tier to fs. (#43600) Signed-off-by: Rotem Shavitt <rshavitt@gmail.com> * [BugFix] Fix hard-coded timeout for multi-API-server startup (#43768) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> * [Kernel] Marlin MoE: include SM 12.x in default arch list (#40923) Signed-off-by: Tony Liu <tonyliu0512@gmail.com> Co-authored-by: Tony Liu <tonyliu0512@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> * [DSV4] Remove AMD/XPU path in deepseek_v4/nvidia (#43829) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * Restore `Literal` for `WeightTransferConfig.backend` (#43183) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Bugfix] Stream DeepSeek DSML tool-call argument deltas incrementally (#42879) Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> * [ROCm][CI] Move workload from MI300 to MI325 (#43824) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [Feature] Add support for timed trace replay in `vllm bench serve` to replay Moonshot and Alibaba workload traces (#39795) Signed-off-by: Animesh Trivedi <Animesh.Trivedi@ibm.com> * [UX] Increase DP Coordinator startup timeout from 30s to 120s (#42343) Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> * [Model][Bugfix] Rename weight_mapper to hf_to_vllm_mapper in LlamaNemotronVL pooling models (#43581) Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com> Co-authored-by: opencode <noreply@opencode.ai> Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com> * [Bugfix][ROCm] Fix Accuracy Drop in Sparse Indexer on gfx950 (#43781) Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> Co-authored-by: tjtanaa <tunjian.tan@embeddedllm.com> Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com> * [Bugfix] Fix HyperCLOVAX CI failure after upstream removed remote code (#43860) Signed-off-by: Kevin Luu <kevin@inferact.ai> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [CI] Auto-apply `rust` label to relevant PRs (#43866) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * [Feature] Add structured output and effort support to Anthropic Messages API (#42396) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> * Log dummy DP step in iteration details (#41406) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> * [EC Connector] Add shutdown API to EC Connector. (#42423) Signed-off-by: omerpaz95 <omerpaz95@gmail.com> * Fix `OlmoHybridForCausalLM` not initialising (#43846) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [BUGFIX] Multimodal benchmark with MistralTokenizer (#42965) Signed-off-by: juliendenize <julien.denize@mistral.ai> Signed-off-by: Julien Denize <40604584+juliendenize@users.noreply.github.com> * [Perf] Optimize moe permute by pre-allocate buffer, 9~14% kernel performance improvement (#43014) Signed-off-by: yewentao256 <zhyanwentao@126.com> * [Perf][KDA] Fuse gate softplus, chunk-local cumsum, and RCP_LN2 scaling (#43667) Signed-off-by: haojiangzheng <justineric096@gmail.com> Co-authored-by: haojiangzheng <justineric096@gmail.com> * Add token-offset based selective offload in OffloadConnector (#39983) Signed-off-by: Angelo Ruocco <ang@zurich.ibm.com> Co-authored-by: Or Ozeri <or@ozery.com> * [Model Refactoring] Remove torch compile dependency in DSv4 (#43746) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [Bugfix][ROCm] Resolve MoRI connector hangs at high concurrency (#40344) Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> * [CPU] Migrate cpu_awq into awq_marlin (#43841) Signed-off-by: jiang1.li <jiang1.li@intel.com> * [Rust Frontend] Add `hy_v3` tool parser (#43872) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * [Rust Frontend] Reduce Gemma4 tool parser args scan complexity (#43850) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * [rust] fix: aggregate `is_sleeping` and `reset_prefix_cache` across DP engines (#43429) Signed-off-by: Will.hou <1205157517@qq.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [Bug] Fix `tests/distributed/test_elastic_ep.py - assert False` (#43813) Signed-off-by: yewentao256 <zhyanwentao@126.com> * [Perf] Add do_not_specialize to Mamba SSD chunk kernels (#43803) Signed-off-by: Majid Taheri Andani <tahemaji@amazon.com> Co-authored-by: Majid Taheri Andani <tahemaji@amazon.com> * [Bugfix] Exclude Ray DP from #42585's deferred port allocation (#43864) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> * [KV Offload] Rename `SecondaryTierManager.get_finished()` to `get_finished_jobs()` (#43870) Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com> * [ROCm][Perf] Support N=5 in wvSplitK skinny GEMM kernels for speculative decoding (#40687) Signed-off-by: Matthias Gehre <matthias.gehre@amd.com> * [XPU][MoE] Add WNA16 oracle backend for GPTQ sym-int4 (xpu_fused_moe) (#41426) Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [ROCm] Bump ROCm to 7.2.3 (#43136) Signed-off-by: Micah Williamson <micah.williamson@amd.com> * Add Cosmos3 Reasoner model (#43356) Signed-off-by: Maciej Bala <mbala@nvidia.com> Signed-off-by: MaciejBalaNV <mbala@nvidia.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Isotr0py <2037008807@qq.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Roger Wang <hey@rogerw.io> * [Rust Frontend] Optimize multimodal prompt expansion (#43670) Signed-off-by: RickyChen / 陳昭儒 <ricky.chen@infinirc.com> * Allow native KV cache dtype in Triton cache update (#43330) Signed-off-by: Michael Gschwind <mgschwind@nvidia.com> Co-authored-by: Michael Gschwind <mgschwind@nvidia.com> * [Attention][AMD] Standardize kv layout to blocks first for AMD (#43660) Signed-off-by: NickLucche <nlucches@redhat.com> * [ROCm] Enable the aiter top-k/top-p sampler by default (#43331) Signed-off-by: John Qin <yanyuan.qin@amd.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com> * [MM][CG] Avoid over-padding Qwen2.5-VL encoder cudagraph window metadata (#42796) Signed-off-by: Hua Huang <huah@nvidia.com> * Deprecate `JAISLMHeadModel` (#43784) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Feat] Add support for per GPU worker RDMA NIC selection (#42083) Signed-off-by: Raj Joshi <rajjoshi@redhat.com> Co-authored-by: Cursor <cursoragent@cursor.com> * [Core] Cleanup KVConnector handling with PP + fix MRV2 (#43732) Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [KV Offload] Add per-request offloading policy via `on_new_request` lifecycle hook (#43205) Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com> Co-authored-by: Or Ozeri <or@ozery.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [Model Refactoring] Remove unncessary torch op registration for DSv4 (#43891) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [Spec Decode] Allow causal DFlash (#43445) * Refactor output filename handling in ci-fetch-log.sh (#43901) Signed-off-by: Michael Goin <mgoin64@gmail.com> * [AMD][CI][BugFix] Fix Distributed Compile Unit Tests (2xH100-2xMI300) group (#43120) Signed-off-by: Randall Smith <Randall.Smith@amd.com> * fix(frontend): Add multimodal placeholders to Gemma4 tool message template (#41459) Signed-off-by: Harshal Janjani <harshaljanjani@gmail.com> Co-authored-by: Ben Browning <bbrownin@redhat.com> * [CI] Enable prefix caching in BFCL benchmark (#43925) Signed-off-by: Yifan Zong <yzong@redhat.com> * [Model]Support Step-3.7-Flash (#43859) Signed-off-by: luotingdan <luotingdan@stepfun.com> Signed-off-by: Isotr0py <Isotr0py@outlook.com> Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> Co-authored-by: luotingdan <luotingdan@stepfun.com> Co-authored-by: Isotr0py <Isotr0py@outlook.com> Co-authored-by: Yu Huang <yuhuang@nvidia.com> Co-authored-by: Jee Jee Li <jeejeelee@inferact.ai> * [Rust Frontend] Add `/version` endpoint using engine-reported value (#43854) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * [Misc][NUMA] Auto-bind to PCT priority cores on DGX B300 + widen EngineCore across shard NUMA nodes (#43270) Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Co-authored-by: Cursor <noreply@cursor.com> * [DSv4] Move mHC tilelang kernels & Don't use CustomOP in dsv4/nvidia (#43905) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [feat] add GlmgaProcessor specific logits in `glm4_1v.py` (#43575) Signed-off-by: JaredforReal <w13431838023@gmail.com> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: Isotr0py <Isotr0py@outlook.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Isotr0py <Isotr0py@outlook.com> * Adjust design around encoder_cudagraph_forward (#42288) Signed-off-by: Weida Hong <wdhongtw@google.com> * [XPU] add scale transpose to prepare_fp8_moe_layer_for_xpu and bump up kernels (#43277) Signed-off-by: mayuyuace <qiming1.zhang@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [kv_offload] Skip decode-phase blocks in CPU offload (#43797) Signed-off-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Itay Etelis <itay.etelis@ibm.com> * [Refactor] Remove dead code (#43234) Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [9/n] Migrate attention and cache kernels to torch stable ABI (continued) (#43717) Signed-off-by: Chris Leonard <chleonar@redhat.com> Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> * [CI] Separate non-root smoke tests from image build step (#43712) Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [XPU] add gelu_tanh to xpu moe backend supported activations (#42822) Signed-off-by: yintong-lu <yintong.lu@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [CPU Backend] CPU top-k and top-p sampling kernels using Triton (#43633) Signed-off-by: Li, Tianmu <tianmu.li@intel.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [ROCm][DSv4] Remove device pipeline stall in sparse attention (#43898) Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> * [Frontend]Responses API supports chat_template_kwargs (#43761) Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> * [ROCm][CI] Fix AITER unified attention for encoder-decoder cross-attention (#43945) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [XPU] fix xpu install document triton-xpu version (#43947) Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> * [CI][ROCm] Don't skip MoRI-IO Connector tests (#43703) Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> * [XPU] support MTP of gdn attention (#43565) Signed-off-by: mayuyuace <qiming1.zhang@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [CI] Nixl+SimpleCPUOffloadingConnector unit tests (#43871) Signed-off-by: NickLucche <nlucches@redhat.com> * [Bugfix] Fix Step3 pipeline parallel KeyError for residual tensor (#37622) Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> * [Kernel][ROCm] Native W4A16 kernel for AMD RDNA3 (gfx1100) — fp16 + bf16 (#41394) Signed-off-by: JartX <sagformas@epdcenter.es> * [Bugfix] [ROCm] [DSV4] Fix AITER MXFP4 MoE weight loading and shuffle… (#42595) Co-authored-by: MHYangAMD <MHYangAMD@users.noreply.github.com> * [ROCm][Perf] DSv3.2 MI355X TP4 decode-step orchestration cleanup (3 micro-opts) (#42982) Signed-off-by: Frida Andersson <fanderss@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com> * [Bugfix] Corrupted MLA + linear attention (#43961) Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg> * Skip docs build if PR doesn't affect docs (#43972) Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * [Bugfix][CPU] Remove invalid extra deps (#43977) Signed-off-by: jiang1.li <jiang1.li@intel.com> * Add vLLM library info to Hugging Face Hub requests (#43857) Signed-off-by: Wauplin <lucainp@gmail.com> Signed-off-by: Lucain Pouget <lucain@huggingface.co> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * docs: clarify ITL acronym in optimization docs (#43922) Signed-off-by: chunyang.wen <chunyang.wen@gmail.com> * [Misc] added unit tests for the core pooling methods (#43818) Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> * [Bugfix] Disable allreduce_rms_fusion when pipeline_parallel_size > 1 (#43616) Signed-off-by: zixi-qi <zixi@inferact.ai> Co-authored-by: Claude <noreply@anthropic.com> * [MoE Refactor] WNA16 MoE backend selection into oracle module (#42553) Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Claude <noreply@anthropic.com> * [EPLB] Make async EPLB default (#43219) Signed-off-by: Markov Ilya <markovilya19@gmail.com> Co-authored-by: Markov Ilya <markovilya19@gmail.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> * [Bugfix] Use storage_block_size in KV cache reshape for compressed specs (DeepSeek V4) (#43988) Signed-off-by: zixi-qi <zixi@inferact.ai> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * [Bugfix] Fix Ray placement group allocation with grouped nodes (#43998) Signed-off-by: <conway.zhu@cohere.com> Signed-off-by: root <conway.zhu@cohere.com> * [Bug] Fix torch device issue for MOE permute (#44005) Signed-off-by: yewentao256 <zhyanwentao@126.com> * [CI] Make Model Executor test hangs fail fast with a traceback (#43971) Signed-off-by: khluu <khluu000@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> * [CI] Remove redundant test_chat_with_tool_reasoning.py (#44011) Signed-off-by: sfeng33 <4florafeng@gmail.com> * Add @khluu to CODEOWNERS (#44019) Signed-off-by: Kevin H. Luu <khluu000@gmail.com> * [Feature] SSL support for dp supervisor (#43688) Signed-off-by: yewentao256 <zhyanwentao@126.com> * [Metrics] Exclude KV transfer tokens from iteration_tokens_total (#43346) Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * [Fronten] Clean up stop_token_ids override for Harmony (#44009) Signed-off-by: Yifan Zong <yzong@redhat.com> * [MoE Refactor] Migrate MoeWNA16Method quantization to MK oracle (#42647) Signed-off-by: Bill Nell <bnell@redhat.com> Co-authored-by: Claude <noreply@anthropic.com> * [MoE Refactor] Remove supports_expert_map (#43108) Signed-off-by: Bill Nell <bnell@redhat.com> * [CI] Remove duplicate Harmony test coverage (#44023) Signed-off-by: sfeng33 <4florafeng@gmail.com> * [CI] Fix smoke test step key to bypass block gate (#43974) Signed-off-by: khluu <khluu000@gmail.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Revert "[MoE Refactor] Migrate MoeWNA16Method quantization to MK orac… (#44033) Signed-off-by: Bill Nell <bnell@redhat.com> * [PERF]MiniMax-M2 gate kernel (#38445) Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: qianlihuang <91178480+qianlihuang@users.noreply.github.com> Co-authored-by: Yiliu Dong <91178480+qianlihuang@users.noreply.github.com> * offload prompt_embeds decode in render_prompts_async to avoid blocking (#43792) Signed-off-by: Gagan Dhakrey <gagandhakrey@gmail.com> * [Refactor] Remove dead current_tool_name_sent assignments from tool parsers (#43997) Signed-off-by: sfeng33 <4florafeng@gmail.com> * [ROCm][CI] Fix failure in the Phi3V pooling test (#44028) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [ROCm] cmake: support PYTORCH_FOUND_HIP for torch 2.13 native HIP language support (#43881) Signed-off-by: nemanjaudovic <nudovic@amd.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> * [BugFix][Platform] Fix import vllm.platforms.rocm error on non-CUDA test_gpt_oss.py (#43571) Signed-off-by: Ma, Liangliang <liangliang.ma@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [Bugfix] Fix RMSNorm kernels to multiply in weight's native dtype (#42379) Signed-off-by: Lanze Liu <lanzetech@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [ROCm] Add attention sink support to AITer flash attention backend (#43817) Signed-off-by: Xiaoran Chen <xiaoran@fb.com> Co-authored-by: Xiaoran Chen <xiaoran@fb.com> * [Governance] Add @BugenZhao as Rust frontend code owner (#44047) Signed-off-by: Bugen Zhao <i@bugenzhao.com> * [Bug] Fix gemma4 MTP IMA issue when TP>1, `CUDA error: an illegal memory access was encountered` (#43909) Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [MRV2] Support breakable CUDA graph (#44050) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [CPU][Zen] Route W8A8 and W4A16 linear inference through zentorch on AMD Zen CPUs (#41813) Signed-off-by: R <Ganesh.R@amd.com> Signed-off-by: Harshal Adhav <harshal.adhav@amd.com> Signed-off-by: Aakar Dwivedi <aadwived@amd.com> Co-authored-by: R <Ganesh.R@amd.com> Co-authored-by: Harshal Adhav <harshal.adhav@amd.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> * [CI/Build] Enable Step3p7ForConditionalGeneration testing (#43956) Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> * docs: fix MLA attention docstring examples (#44118) Co-authored-by: nightcityblade <nightcityblade@gmail.com> * [Misc] Use VLLMValidationError consistently in chat completion and completion protocol validators (#36254) Signed-off-by: umut-polat <52835619+umut-polat@users.noreply.github.com> * [MRV2] Remove Eagle's dedicated CUDA graph pool (#44078) Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> * [BugFix] Fix `_has_module` to verify native deps via trial import (#44035) Signed-off-by: esmeetu <jasonailu87@gmail.com> Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: esmeetu <jasonailu87@gmail.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> * [Docs] Replace broken video url in examples (#44159) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> * [CPU][RISC-V] Add missing RVV cpu_types helpers for WNA16 (#42730) Signed-off-by: wcy <233313160abc@gmail.com> Co-authored-by: Li, Jiang <jiang1.li@intel.com> * fix: glm5.1 pp model loading (#42944) Signed-off-by: UranusSeven <109661872+UranusSeven@users.noreply.github.com> * [Frontend] Resettle generative scoring entrypoint. (#44153) Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> * [Rust Frontend] Add InternLM2 tool parser (#43481) Signed-off-by: Will.hou <1205157517@qq.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Bugen Zhao <i@bugenzhao.com> * [Bugfix] fix wrong partial_rotary_factor calculation for bailing_moe model. (#43770) Signed-off-by: zzt <zengzetang.zzt@antgroup.com> Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com> * [XPU][CI] Fix test_audio_in_video flake by using module-scoped server fixture (#44146) Signed-off-by: Chaojun Zhang <chaojun.zhang@intel.com> * [Perf] Optimize cutlass fp8 scaled mm bypassing padding, 20% kernel performance improvement (#43706) Signed-off-by: yewentao256 <zhyanwentao@126.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> * [Feature] Add support for JetBrains' Mellum v2 code generation model (#43992) Signed-off-by: Madeesh Kannan <madeeswaran.kannan@jetbrains.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> * [Kernel][DSv4] Optimize sparse FP8 compressor kernels (#44161) Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> * [ROCm][CI] Fix and stabilize EAGLE3 acceptance tests (#41294) Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Micah Williamson <micah.williamson@amd.com> Co-authored-by: Micah Williamson <micah.williamson@amd.com> * [Rust Frontend] Support streaming `generate` endpoint (#43779) Signed-off-by: xunzhuo <xunzhuo@vllm-semantic-router.ai> Co-authored-by: Bugen Zhao <i@bugenzhao.com> * [Frontend][Core] Add sparse NCCL weight transfer support for in-place updates (#40096) Signed-off-by: Siddharth Bedekar <bedeksid@gmail.com> Co-authored-by: OpenAI Codex <codex@openai.com> * [BugFix][CI] Fix added `_has_module` tests (#44248) Signed-off-by: Nick Hill <nickhill123@gmail.com> * [Test][BugFix] Fix double-BOS in PD+specdec acceptance test (#44234) Signed-off-by: Nick Hill <nickhill123@gmail.com> * [DSV4] Remove unncessary classes & functions (#44246) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [ROCm][CI] Skip unbacked dynamic shapes tests on PyTorch < 2.11 (#44256) Signed-off-by: JartX <sagformas@epdcenter.es> * [DSV4] Refactor RoPE initialization (#44262) Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> * [Bugfix][Mooncake] Release GPU pin on failed store in MooncakeStoreConnector (#43742) Signed-off-by: Dao Le <Dao007forever@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> * [ROCm] Upgrade AITER to v0.1.13.post1 (#44265) Signed-off-by: Micah Williamson <micah.williamson@amd.com> * [Bugfix][CI] Normalize NIXL connector CUDA wheel installs (#44266) Signed-off-by: Alec Flowers <aflowers@nvidia.com> * [Refactor] Move unstreamed tool-arg flush from serving layer to parser (#44017) Signed-off-by: sfeng33 <4florafeng@gmail.com> * [CI] Stabilize OpenAI schema fuzzing for malformed structural tags (#44131) Signed-off-by: Andreas Karatzas <akaratza@amd.com> * [BugFix] Fix TypeError in MiniCPM-O audio feature unpadding (#38053) Signed-off-by: Krishna Chaitanya Balusu <krishnabkc15@gmail.com> Signed-off-by: wjinxu <1299461899@qq.com> Signed-off-by: Kc Balusu <kcbalusu@users.noreply.github.com> Co-authored-by: wjinxu <1299461899@qq.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Kc Balusu <kcbalusu@users.noreply.github.com> * [BugFix][kv_offload]: Prevent offloading stale sliding window blocks (#42959) Signed-off-by: Or Ozeri <oro@il.ibm.com> * [XPU][Bugfix] Fix per_token_group_fp8_quant missing dummy args on XPU (#43930) Signed-off-by: Chaojun,Zhang <chaojun.zhang@intel.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> * [MM][CG] Profile encoder CUDA graph pool memory (#41714) Signed-off-by: JooHo Lee <jooho414@gmail.com> * [Bugfix] Convert Gemma4-MM ViT linear layers to vllm native impl (#43798) Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: ZiTian Zhao <zitian.zhao@tencentmusic.com> Co-authored-by: B-201 <Joy25810@foxmail.com> * [Model Runner V2] Support zeroing freshly allocated KV blocks for hybrid + fp8 KVCache (#43990) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> * [Model Runner V2] Use actual batch max_seq_len for attn metadata (#43991) Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> * [Refactor] Unify reasoning + tool-call parsing behind Parser.parse() (#44267) Signed-off-by: sfeng33 <4florafeng@gmail.com> --------- Signed-off-by: Hua Huang <huah@nvidia.com> Signed-off-by: holegots <ikun3.1415927@gmail.com> Signed-off-by: Siddharth Bedekar <bedeksid@gmail.com> Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com> Signed-off-by: sfeng33 <4florafeng@gmail.com> Signed-off-by: yewentao256 <zhyanwentao@126.com> Signed-off-by: Dao Le <daole@inferact.ai> Signed-off-by: Dao Le <Dao007forever@gmail.com> Signed-off-by: Or Ozeri <oro@il.ibm.com> Signed-off-by: Andreas Karatzas <akaratza@amd.com> Signed-off-by: Banani Ghosh <bg2502@nyu.edu> Signed-off-by: Daniel Serebrenik <daserebrenik@nvidia.com> Signed-off-by: Rotem Shavitt <rshavitt@gmail.com> Signed-off-by: Jee Jee Li <jeejeelee@inferact.ai> Signed-off-by: weizhou.lan@daocloud.io <weizhou.lan@daocloud.io> Signed-off-by: wang.yuqi <yuqi.wang@daocloud.io> Signed-off-by: wang.yuqi <noooop@126.com> Signed-off-by: TheDuyIT <nduy250299@gmail.com> Signed-off-by: dtnguyen <dtnguyen@nvidia.com> Signed-off-by: Yifan Qiao <yifanqiao@inferact.ai> Signed-off-by: esmeetu <jasonailu87@gmail.com> Signed-off-by: Yihuki <wangbovbvb@gmail.com> Signed-off-by: Zhewen Li <zhewenli@inferact.ai> Signed-off-by: NickLucche <nlucches@redhat.com> Signed-off-by: chaojun-zhang <chaojun.zhang@intel.com> Signed-off-by: Chaojun Zhang <chaojun.zhang@intel.com> Signed-off-by: Jee Jee Li <pandaleefree@gmail.com> Signed-off-by: Woosuk Kwon <woosuk@inferact.ai> Signed-off-by: Yan Ma <yan.ma@intel.com> Signed-off-by: QingZhou-YangHY <3868850350@qq.com> Signed-off-by: Thien Tran <gau.nernst@yahoo.com.sg> Signed-off-by: zhejiangxiaomai <zhenhui.zhao@intel.com> Signed-off-by: Yongye Zhu <zyy1102000@gmail.com> Signed-off-by: Hank <hcc.mayday@gmail.com> Signed-off-by: Yubo Wang <yubowang2019@gmail.com> Signed-off-by: Ethan Feng <ethan.fengch@gmail.com> Signed-off-by: simondanielsson <simon.danielsson99@hotmail.com> Signed-off-by: ThibaultCastells <thib.castells@icloud.com> Signed-off-by: linzm1007 <linzm1007@126.com> Signed-off-by: Jesus De Jesus <dejesus.9297@gmail.com> Signed-off-by: javierdejesusda <javier.dejesusj9@gmail.com> Signed-off-by: Chaojun,Zhang <chaojun.zhang@intel.com> Signed-off-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> Signed-off-by: Simon Danielsson <70206058+simondanielsson@users.noreply.github.com> Signed-off-by: Bill Nell <bnell@redhat.com> Signed-off-by: weimingc <17592131+meenchen@users.noreply.github.com> Signed-off-by: Mark McLoughlin <markmc@redhat.com> Signed-off-by: Kevin Luu <kevin@inferact.ai> Signed-off-by: Zhewen Li <zhewen@inferact.ai> Signed-off-by: Vadim Gimpelson <vadim.gimpelson@gmail.com> Signed-off-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Signed-off-by: khluu <khluu000@gmail.com> Signed-off-by: Kevin H. Luu <khluu000@gmail.com> Signed-off-by: Luciano Martins <lucianommartins@users.noreply.github.com> Signed-off-by: Bugen Zhao <i@bugenzhao.com> Signed-off-by: Xin Yang <xyangx@amazon.com> Signed-off-by: Lucas Wilkinson <lwilkins@redhat.com> Signed-off-by: Matthew Bonanni <mbonanni@redhat.com> Signed-off-by: Angela Yi <yiangela7@gmail.com> Signed-off-by: aoshen524 <aoshen524@gmail.com> Signed-off-by: Ao Shen <aoshen@inferact.ai> Signed-off-by: nholmber <nholmber@users.noreply.github.com> Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: zhangtao <zhangtao2@modelbest.cn> Signed-off-by: zhangtao2 <zhangtao2@modelbest.cn> Signed-off-by: Aakif Nawaz <aakif.nawaz@amd.com> Signed-off-by: Aditya Singh <adisin650@gmail.com> Signed-off-by: Ashwin Giridharan <girida@amazon.com> Signed-off-by: Chauncey <chaunceyjiang@gmail.com> Signed-off-by: Injae Ryou <injaeryou@gmail.com> Signed-off-by: chunyang.wen <chunyang.wen@gmail.com> Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Signed-off-by: jatseng-ai <jatseng@amd.com> Signed-off-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Signed-off-by: Chris Leonard <chleonar@redhat.com> Signed-off-by: Minh Vu <vuhoangminh97@gmail.com> Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Signed-off-by: Benjamin Chislett <chislett.ben@gmail.com> Signed-off-by: Dakai An <dakaian108@gmail.com> Signed-off-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com> Signed-off-by: MengqingCao <cmq0113@163.com> Signed-off-by: Mengqing Cao <cmq0113@163.com> Signed-off-by: bbartels <benjamin@bartels.dev> Signed-off-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> Signed-off-by: ignaciosica <mignacio.sica@gmail.com> Signed-off-by: JINO-ROHIT <find.jinorohit@gmail.com> Signed-off-by: Tony Liu <tonyliu0512@gmail.com> Signed-off-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Signed-off-by: Animesh Trivedi <Animesh.Trivedi@ibm.com> Signed-off-by: wzhao18 <wzhao18.sz@gmail.com> Signed-off-by: Jakub Zakrzewski <jzakrzewski@nvidia.com> Signed-off-by: vllmellm <vllm.ellm@embeddedllm.com> Signed-off-by: kliuae <kuanfu.liu@embeddedllm.com> Signed-off-by: chaunceyjiang <chaunceyjiang@gmail.com> Signed-off-by: omerpaz95 <omerpaz95@gmail.com> Signed-off-by: juliendenize <julien.denize@mistral.ai> Signed-off-by: Julien Denize <40604584+juliendenize@users.noreply.github.com> Signed-off-by: haojiangzheng <justineric096@gmail.com> Signed-off-by: Angelo Ruocco <ang@zurich.ibm.com> Signed-off-by: jiang1.li <jiang1.li@intel.com> Signed-off-by: Will.hou <1205157517@qq.com> Signed-off-by: Majid Taheri Andani <tahemaji@amazon.com> Signed-off-by: Ronen Schaffer <ronen.schaffer@ibm.com> Signed-off-by: Matthias Gehre <matthias.gehre@amd.com> Signed-off-by: Micah Williamson <micah.williamson@amd.com> Signed-off-by: Maciej Bala <mbala@nvidia.com> Signed-off-by: MaciejBalaNV <mbala@nvidia.com> Signed-off-by: RickyChen / 陳昭儒 <ricky.chen@infinirc.com> Signed-off-by: Michael Gschwind <mgschwind@nvidia.com> Signed-off-by: John Qin <yanyuan.qin@amd.com> Signed-off-by: Raj Joshi <rajjoshi@redhat.com> Signed-off-by: Michael Goin <mgoin64@gmail.com> Signed-off-by: Randall Smith <Randall.Smith@amd.com> Signed-off-by: Harshal Janjani <harshaljanjani@gmail.com> Signed-off-by: Yifan Zong <yzong@redhat.com> Signed-off-by: luotingdan <luotingdan@stepfun.com> Signed-off-by: Isotr0py <Isotr0py@outlook.com> Signed-off-by: JaredforReal <w13431838023@gmail.com> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Signed-off-by: Weida Hong <wdhongtw@google.com> Signed-off-by: mayuyuace <qiming1.zhang@intel.com> Signed-off-by: Itay Etelis <itay.etelis@ibm.com> Signed-off-by: yintong-lu <yintong.lu@intel.com> Signed-off-by: Li, Tianmu <tianmu.li@intel.com> Signed-off-by: Kunshang Ji <kunshang.ji@intel.com> Signed-off-by: JartX <sagformas@epdcenter.es> Signed-off-by: Frida Andersson <fanderss@amd.com> Signed-off-by: Wauplin <lucainp@gmail.com> Signed-off-by: Lucain Pouget <lucain@huggingface.co> Signed-off-by: Taneem Ibrahim <taneem.ibrahim@gmail.com> Signed-off-by: zixi-qi <zixi@inferact.ai> Signed-off-by: Markov Ilya <markovilya19@gmail.com> Signed-off-by: <conway.zhu@cohere.com> Signed-off-by: root <conway.zhu@cohere.com> Signed-off-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Signed-off-by: qianlihuang <91178480+qianlihuang@users.noreply.github.com> Signed-off-by: Gagan Dhakrey <gagandhakrey@gmail.com> Signed-off-by: nemanjaudovic <nudovic@amd.com> Signed-off-by: Ma, Liangliang <liangliang.ma@intel.com> Signed-off-by: Lanze Liu <lanzetech@gmail.com> Signed-off-by: Xiaoran Chen <xiaoran@fb.com> Signed-off-by: R <Ganesh.R@amd.com> Signed-off-by: Harshal Adhav <harshal.adhav@amd.com> Signed-off-by: Aakar Dwivedi <aadwived@amd.com> Signed-off-by: umut-polat <52835619+umut-polat@users.noreply.github.com> Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: wcy <233313160abc@gmail.com> Signed-off-by: UranusSeven <109661872+UranusSeven@users.noreply.github.com> Signed-off-by: zzt <zengzetang.zzt@antgroup.com> Signed-off-by: Madeesh Kannan <madeeswaran.kannan@jetbrains.com> Signed-off-by: xunzhuo <xunzhuo@vllm-semantic-router.ai> Signed-off-by: Alec Flowers <aflowers@nvidia.com> Signed-off-by: Krishna Chaitanya Balusu <krishnabkc15@gmail.com> Signed-off-by: wjinxu <1299461899@qq.com> Signed-off-by: Kc Balusu <kcbalusu@users.noreply.github.com> Signed-off-by: JooHo Lee <jooho414@gmail.com> Signed-off-by: zhuhaoran <zhuhaoran.zhr@alibaba-inc.com> Signed-off-by: Hynek Kydlicek <kydlicek.hynek@gmail.com> Co-authored-by: Hua Huang <huangh1994@outlook.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: Holegots <fuergaosi@gmail.com> Co-authored-by: Siddharth Bedekar <104613085+bedeks@users.noreply.github.com> Co-authored-by: OpenAI Codex <codex@openai.com> Co-authored-by: Dao007forever <dao007forever@gmail.com> Co-authored-by: TJian <tunjian.tan@embeddedllm.com> Co-authored-by: Michael Goin <mgoin64@gmail.com> Co-authored-by: Flora Feng <4florafeng@gmail.com> Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: Or Ozeri <oro@il.ibm.com> Co-authored-by: Andreas Karatzas <akaratza@amd.com> Co-authored-by: danisereb <daserebrenik@nvidia.com> Co-authored-by: Banani Ghosh <bg2502@nyu.edu> Co-authored-by: Rotem Shavitt <rshavitt@gmail.com> Co-authored-by: Jee Jee Li <pandaleefree@gmail.com> Co-authored-by: weizhoublue <45163302+weizhoublue@users.noreply.github.com> Co-authored-by: wang.yuqi <yuqi.wang@daocloud.io> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Nguyễn Thế Duy <dtnguyen@nvidia.com> Co-authored-by: Yifan Qiao <yifanqiao@inferact.ai> Co-authored-by: Roy Wang <jasonailu87@gmail.com> Co-authored-by: Yihuki <wangbovbvb@gmail.com> Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Zhewen Li <zhewenli@meta.com> Co-authored-by: Zhewen Li <zhewenli@inferact.ai> Co-authored-by: Nicolò Lucchesi <nlucches@redhat.com> Co-authored-by: Chaojun Zhang <chaojun.zhang@intel.com> Co-authored-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Co-authored-by: Yan Ma <yan.ma@intel.com> Co-authored-by: Huanyu Yang <20242081160@mail.dlut.edu.cn> Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk> Co-authored-by: Thien Tran <gau.nernst@yahoo.com.sg> Co-authored-by: zhao, zhenhui <zhenhui.zhao@intel.com> Co-authored-by: Sting Lin <sting.lin@cienet.com> Co-authored-by: Jie Fang <jief@nvidia.com> Co-authored-by: Yongye Zhu <zyy1102000@gmail.com> Co-authored-by: Hank_ <37239608+ILikeIneine@users.noreply.github.com> Co-authored-by: Yubo Wang <yubowang2019@gmail.com> Co-authored-by: Ethan Feng <ethan.fengch@gmail.com> Co-authored-by: Simon Danielsson <70206058+simondanielsson@users.noreply.github.com> Co-authored-by: Thibault Castells <38716394+ThibaultCastells@users.noreply.github.com> Co-authored-by: linzm1007 <96732179+linzm1007@users.noreply.github.com> Co-authored-by: Javier De Jesus <javier.dejesusj9@gmail.com> Co-authored-by: Mohammad Miadh Angkad <176301910+mmangkad@users.noreply.github.com> Co-authored-by: bnellnm <49004751+bnellnm@users.noreply.github.com> Co-authored-by: Robert Shaw <114415538+robertgshaw2-redhat@users.noreply.github.com> Co-authored-by: Wei-Ming Chen <17592131+meenchen@users.noreply.github.com> Co-authored-by: Mark McLoughlin <markmc@redhat.com> Co-authored-by: Kevin H. Luu <khluu000@gmail.com> Co-authored-by: Zhewen Li <zhewen@inferact.ai> Co-authored-by: Vadim Gimpelson <156319763+vadiklyutiy@users.noreply.github.com> Co-authored-by: Luciano Martins <22145370+lucianommartins@users.noreply.github.com> Co-authored-by: Luciano Martins <lucianommartins@users.noreply.github.com> Co-authored-by: Bugen Zhao <i@bugenzhao.com> Co-authored-by: Xin Yang <105740670+xyang16@users.noreply.github.com> Co-authored-by: Lucas Wilkinson <LucasWilkinson@users.noreply.github.com> Co-authored-by: Matthew Bonanni <mbonanni@redhat.com> Co-authored-by: Angela Yi <yiangela7@gmail.com> Co-authored-by: aoshen02 <aoshen@inferact.ai> Co-authored-by: aoshen524 <aoshen524@gmail.com> Co-authored-by: Nico Holmberg <nico.holmberg@amd.com> Co-authored-by: Nick Hill <nickhill123@gmail.com> Co-authored-by: zhangtao2-1 <478679312@qq.com> Co-authored-by: zhangtao <zhangtao2@modelbest.cn> Co-authored-by: Chauncey <chaunceyjiang@gmail.com> Co-authored-by: akii96 <aakif.nawaz@amd.com> Co-authored-by: Aditya Singh <60082699+adityasingh2400@users.noreply.github.com> Co-authored-by: Ashwin Giridharan <ashwing@users.noreply.github.com> Co-authored-by: Injae Ryou <injaeryou@gmail.com> Co-authored-by: Chunyang Wen <chunyang.wen@gmail.com> Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: jatseng-ai <jatseng@amd.com> Co-authored-by: Chris Leonard <chleonar@redhat.com> Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> Co-authored-by: Minh Vu <vuhoangminh97@gmail.com> Co-authored-by: Benjamin Chislett <bchislett@nvidia.com> Co-authored-by: Nicolò Lucchesi <nicolo.lucchesi@gmail.com> Co-authored-by: Dakai An <77474977+andakai@users.noreply.github.com> Co-authored-by: amitz-nv <203509407+amitz-nv@users.noreply.github.com> Co-authored-by: Mengqing Cao <cmq0113@163.com> Co-authored-by: Benjamin Bartels <benjamin@bartels.dev> Co-authored-by: Yufeng He <40085740+he-yufeng@users.noreply.github.com> Co-authored-by: Ignacio Sica <mignacio.sica@gmail.com> Co-authored-by: JINO ROHIT <find.jinorohit@gmail.com> Co-authored-by: tonyliu312 <56969792@qq.com> Co-authored-by: Tony Liu <tonyliu0512@gmail.com> Co-authored-by: jack <QwertyJack@users.noreply.github.com> Co-authored-by: QwertyJack <7554089+QwertyJack@users.noreply.github.com> Co-authored-by: Animesh Trivedi <animesh.trivedi@gmail.com> Co-authored-by: Wei Zhao <51183510+wzhao18@users.noreply.github.com> Co-authored-by: Jakub Zakrzewski <jzakrzewski@nvidia.com> Co-authored-by: opencode <noreply@opencode.ai> Co-authored-by: tomeras91 <57313761+tomeras91@users.noreply.github.com> Co-authored-by: kliuae <17350011+kliuae@users.noreply.github.com> Co-authored-by: vllmellm <vllm.ellm@embeddedllm.com> Co-authored-by: omerpaz95 <73347585+omerpaz95@users.noreply.github.com> Co-authored-by: Julien Denize <40604584+juliendenize@users.noreply.github.com> Co-authored-by: zexplorerhj <zhjoneson@163.com> Co-authored-by: haojiangzheng <justineric096@gmail.com> Co-authored-by: Angelo Ruocco <angeloruocco90@gmail.com> Co-authored-by: Or Ozeri <or@ozery.com> Co-authored-by: Li, Jiang <jiang1.li@intel.com> Co-authored-by: Will.hou <1205157517@qq.com> Co-authored-by: Majid <mjtaheri68@gmail.com> Co-authored-by: Majid Taheri Andani <tahemaji@amazon.com> Co-authored-by: Ronen Schaffer <ronen.schaffer@ibm.com> Co-authored-by: Matthias Gehre <matthias.gehre@amd.com> Co-authored-by: Jason Elie Bou Kheir <5115126+jasonboukheir@users.noreply.github.com> Co-authored-by: Kunshang Ji <kunshang.ji@intel.com> Co-authored-by: Micah Williamson <micah.williamson@amd.com> Co-authored-by: MaciejBalaNV <mbala@nvidia.com> Co-authored-by: Isotr0py <2037008807@qq.com> Co-authored-by: Roger Wang <hey@rogerw.io> Co-authored-by: Chao-Ju Chen <ricky.chen@infinirc.com> Co-authored-by: Mike G <180722391+mikekg@users.noreply.github.com> Co-authored-by: Michael Gschwind <mgschwind@nvidia.com> Co-authored-by: JohnQinAMD <yanyuan.qin@amd.com> Co-authored-by: Hua Huang <huah@nvidia.com> Co-authored-by: Raj Joshi <rajjoshi@g.harvard.edu> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: rasmith <Randall.Smith@amd.com> Co-authored-by: Harshal Janjani <harshaljanjani@gmail.com> Co-authored-by: Ben Browning <bbrownin@redhat.com> Co-authored-by: yzong-rh <yzong@redhat.com> Co-authored-by: ltd0924 <32387785+ltd0924@users.noreply.github.com> Co-authored-by: luotingdan <luotingdan@stepfun.com> Co-authored-by: Isotr0py <Isotr0py@outlook.com> Co-authored-by: Yu Huang <yuhuang@nvidia.com> Co-authored-by: Jee Jee Li <jeejeelee@inferact.ai> Co-authored-by: Cursor <noreply@cursor.com> Co-authored-by: Jared Wen <w13431838023@gmail.com> Co-authored-by: Weida Hong <wdhongtw@google.com> Co-authored-by: Qiming Zhang <qiming1.zhang@intel.com> Co-authored-by: Itay Etelis <92247226+Etelis@users.noreply.github.com> Co-authored-by: Itay Etelis <itay.etelis@ibm.com> Co-authored-by: Yintong Lu <yintong.lu@intel.com> Co-authored-by: Tianmu Li <tianmu.li@intel.com> Co-authored-by: Joaquín Mondéjar <111321569+JMonde@users.noreply.github.com> Co-authored-by: JartX <sagformas@epdcenter.es> Co-authored-by: MHYangAMD <meng-hsuan.yang@amd.com> Co-authored-by: MHYangAMD <MHYangAMD@users.noreply.github.com> Co-authored-by: frida-andersson <fanderss@amd.com> Co-authored-by: Lucain <lucainp@gmail.com> Co-authored-by: Taneem Ibrahim <taneem.ibrahim@gmail.com> Co-authored-by: qizixi <22851944+zixi-qi@users.noreply.github.com> Co-authored-by: Ilya Markov <markovilya197@gmail.com> Co-authored-by: Markov Ilya <markovilya19@gmail.com> Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com> Co-authored-by: czhu-cohere <conway.zhu@cohere.com> Co-authored-by: Tyler Michael Smith <tlrmchlsmth@gmail.com> Co-authored-by: Yiliu Dong <91178480+qianlihuang@users.noreply.github.com> Co-authored-by: Gagan Dhakrey <59848316+gagandhakrey@users.noreply.github.com> Co-authored-by: nemanjaudovic <152565955+nemanjaudovic@users.noreply.github.com> Co-authored-by: Liangliang Ma <liangliang.ma@intel.com> Co-authored-by: Lanze Liu <86434077+liulanze@users.noreply.github.com> Co-authored-by: Xiaoran <claire.rrchen@hotmail.com> Co-authored-by: Xiaoran Chen <xiaoran@fb.com> Co-authored-by: Aakar Dwivedi <82587125+aadwived@users.noreply.github.com> Co-authored-by: R <Ganesh.R@amd.com> Co-authored-by: Harshal Adhav <harshal.adhav@amd.com> Co-authored-by: nightcityblade <jackchen@haloailabs.com> Co-authored-by: nightcityblade <nightcityblade@gmail.com> Co-authored-by: Umut Polat <52835619+umut-polat@users.noreply.github.com> Co-authored-by: Jeffrey Wang <jeffreywang@anyscale.com> Co-authored-by: wcy <86111164+wcynb1023@users.noreply.github.com> Co-authored-by: Uranus <109661872+UranusSeven@users.noreply.github.com> Co-authored-by: zzt <mf1732009@smail.nju.edu.cn> Co-authored-by: Jiangyun Zhu <riverclouds.zhu@qq.com> Co-authored-by: Madeesh Kannan <shadeMe@users.noreply.github.com> Co-authored-by: Xunzhuo <xunzhuo@vllm-semantic-router.ai> Co-authored-by: Alec <35311602+alec-flowers@users.noreply.github.com> Co-authored-by: Krishna Chaitanya <krishnabkc15@gmail.com> Co-authored-by: wjinxu <1299461899@qq.com> Co-authored-by: Kc Balusu <kcbalusu@users.noreply.github.com> Co-authored-by: JooHo Lee <96564470+BWAAEEEK@users.noreply.github.com> Co-authored-by: ZiTian Zhao <zitian.zhao@tencentmusic.com> Co-authored-by: B-201 <Joy25810@foxmail.com> Co-authored-by: zhrrr <43847754+izhuhaoran@users.noreply.github.com>
mvanhorn
pushed a commit
to mvanhorn/vllm
that referenced
this pull request
Jun 4, 2026
…ecs (DeepSeek V4) (vllm-project#43988) Signed-off-by: zixi-qi <zixi@inferact.ai> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
andakai
pushed a commit
to andakai/vllm
that referenced
this pull request
Jun 4, 2026
…ecs (DeepSeek V4) (vllm-project#43988) Signed-off-by: zixi-qi <zixi@inferact.ai> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
JisoLya
pushed a commit
to JisoLya/vllm
that referenced
this pull request
Jun 5, 2026
…ecs (DeepSeek V4) (vllm-project#43988) Signed-off-by: zixi-qi <zixi@inferact.ai> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: JisoLya <523420504@qq.com>
knight0528
pushed a commit
to knight0528/vllm
that referenced
this pull request
Jun 8, 2026
…ecs (DeepSeek V4) (vllm-project#43988) Signed-off-by: zixi-qi <zixi@inferact.ai> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
waqahmed-amd-fi
pushed a commit
to waqahmed-amd-fi/vllm
that referenced
this pull request
Jun 10, 2026
…ecs (DeepSeek V4) (vllm-project#43988) Signed-off-by: zixi-qi <zixi@inferact.ai> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>
divineearthly
pushed a commit
to divineearthly/vllm
that referenced
this pull request
Jun 19, 2026
…ecs (DeepSeek V4) (vllm-project#43988) Signed-off-by: zixi-qi <zixi@inferact.ai> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: divineearthly <divineearthly@gmail.com>
choiceoh
added a commit
to choiceoh/vllm-dsv4
that referenced
this pull request
Jun 20, 2026
… prod image Adds dsv4-tiera2-build/: a Dockerfile + nested unified patch that rebuilds the production DeepSeek-V4 (b12x) image (dsv4-tiera2:local) from a reviewable patch stack instead of opaque in-image string substitutions. Layout reconciliation (the core problem): the dsv4-tiera2 branch carries its DeepSeek-V4 source in the FLAT upstream layout (cherry-picked onto DSV4@eb99b8b, pre-refactor), while the running image was built on the Aiden b12x snapshot at vLLM 1967a56 (3 days later, post-refactor) which uses a NESTED per-platform package (vllm/models/deepseek_v4/{nvidia,amd}/mtp.py) plus an AMD variant the git branch lacks. eb99b8b and 1967a56 are git-diverged (318/127); of the 18 patched files, 6 differ between snapshots for non-patch reasons and 2 are the model package itself. So the flat git files are not a 1:1 drop-in. Decision: path-map overlay. FROM the pristine 1967a56 base (correct nested layout + b12x binaries) and apply patches/dsv4-tiera2.patch, the complete base->tiera2 delta (18 files, 54 hunks) expressed against the nested paths, encoding the same transform chain the image was built with (5 base-tier PRs + deneb#2289 leakfix + 4 new-tier PRs; vllm-project#43988 skipped). Verified: git apply --check clean on the pristine base; AST gate over all 18 files; import smoke (vllm 0.21.1rc1.dev339+g1967a5627bc3 + every patched module); and a full vllm/ source-tree sha256 diff vs dsv4-tiera2:local = ZERO differing source files (3282 files; only regenerated .pyc bytecode differs). Build output tag dsv4-tiera2-src:local leaves the prod dsv4-tiera2:local untouched. The b12x CUDA kernels are prebuilt binaries (base image only) so this is an overlay, not a from-scratch source build. README documents regeneration and the vLLM 0.23 port path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
tunglinwood
pushed a commit
to tunglinwood/vllm
that referenced
this pull request
Jun 22, 2026
…ecs (DeepSeek V4) (vllm-project#43988) Signed-off-by: zixi-qi <zixi@inferact.ai> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
nkzhenhua
pushed a commit
to nkzhenhua/vllm
that referenced
this pull request
Jun 24, 2026
…ecs (DeepSeek V4) (vllm-project#43988) Signed-off-by: zixi-qi <zixi@inferact.ai> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
MingqiWang-coder
added a commit
to vLLM-HUST/vllm-hust
that referenced
this pull request
Jun 30, 2026
Cherry-pick 62 bugfix/security PRs from upstream vllm-project/vllm main (2026-05-03 to 2026-06-17), covering scheduler, engine core, model runner, worker, attention, KV cache, compilation, and structured output fixes. Security (4): vllm-project#43286 vllm-project#44744 vllm-project#45118 vllm-project#45252 Bugfix (56): vllm-project#35536 vllm-project#36616 vllm-project#38895 vllm-project#39155 vllm-project#39324 vllm-project#39562 vllm-project#39805 vllm-project#40398 vllm-project#40726 vllm-project#40727 vllm-project#40737 vllm-project#40749 vllm-project#40961 vllm-project#41119 vllm-project#41133 vllm-project#41233 vllm-project#41237 vllm-project#41411 vllm-project#41496 vllm-project#41549 vllm-project#41674 vllm-project#41873 vllm-project#41895 vllm-project#42040 vllm-project#42112 vllm-project#42289 vllm-project#42479 vllm-project#42585 vllm-project#42692 vllm-project#42706 vllm-project#42709 vllm-project#42739 vllm-project#42967 vllm-project#43001 vllm-project#43079 vllm-project#43125 vllm-project#43160 vllm-project#43616 vllm-project#43669 vllm-project#43719 vllm-project#43768 vllm-project#43808 vllm-project#43961 vllm-project#43982 vllm-project#43988 vllm-project#43998 vllm-project#44057 vllm-project#44560 vllm-project#44574 vllm-project#44568 vllm-project#44603 vllm-project#44744 vllm-project#45195 vllm-project#45345 vllm-project#45383 vllm-project#45487 vllm-project#45564 vllm-project#45673 Runner fix (2): vllm-project#44568 vllm-project#44603 Skipped: vllm-project#43781 (ROCm-specific, not applicable to Ascend NPU) Conflict resolutions: - Manual merge: vllm-project#43286 vllm-project#45118 vllm-project#42112 vllm-project#43160 vllm-project#43719 vllm-project#44560 - Upstream-preferred (-X theirs): vllm-project#43808 vllm-project#43988 vllm-project#42967 vllm-project#35536 vllm-project#45195 - Test files (--theirs): vllm-project#44744 vllm-project#41895 vllm-project#42040 vllm-project#41233 vllm-project#45345 vllm-project#43982 Co-authored-by: GitHub Copilot Signed-off-by: MingqiWang-coder <mingqiwang@hust.edu.cn>
MingqiWang-coder
added a commit
to vLLM-HUST/vllm-hust
that referenced
this pull request
Jun 30, 2026
Cherry-pick 62 bugfix/security PRs from upstream vllm-project/vllm main (2026-05-03 to 2026-06-17), covering scheduler, engine core, model runner, worker, attention, KV cache, compilation, and structured output fixes. Security (4): vllm-project#43286 vllm-project#44744 vllm-project#45118 vllm-project#45252 Bugfix (56): vllm-project#35536 vllm-project#36616 vllm-project#38895 vllm-project#39155 vllm-project#39324 vllm-project#39562 vllm-project#39805 vllm-project#40398 vllm-project#40726 vllm-project#40727 vllm-project#40737 vllm-project#40749 vllm-project#40961 vllm-project#41119 vllm-project#41133 vllm-project#41233 vllm-project#41237 vllm-project#41411 vllm-project#41496 vllm-project#41549 vllm-project#41674 vllm-project#41873 vllm-project#41895 vllm-project#42040 vllm-project#42112 vllm-project#42289 vllm-project#42479 vllm-project#42585 vllm-project#42692 vllm-project#42706 vllm-project#42709 vllm-project#42739 vllm-project#42967 vllm-project#43001 vllm-project#43079 vllm-project#43125 vllm-project#43160 vllm-project#43616 vllm-project#43669 vllm-project#43719 vllm-project#43768 vllm-project#43808 vllm-project#43961 vllm-project#43982 vllm-project#43988 vllm-project#43998 vllm-project#44057 vllm-project#44560 vllm-project#44574 vllm-project#44568 vllm-project#44603 vllm-project#44744 vllm-project#45195 vllm-project#45345 vllm-project#45383 vllm-project#45487 vllm-project#45564 vllm-project#45673 Runner fix (2): vllm-project#44568 vllm-project#44603 Skipped: vllm-project#43781 (ROCm-specific, not applicable to Ascend NPU) Conflict resolutions: - Manual merge: vllm-project#43286 vllm-project#45118 vllm-project#42112 vllm-project#43160 vllm-project#43719 vllm-project#44560 - Upstream-preferred (-X theirs): vllm-project#43808 vllm-project#43988 vllm-project#42967 vllm-project#35536 vllm-project#45195 - Test files (--theirs): vllm-project#44744 vllm-project#41895 vllm-project#42040 vllm-project#41233 vllm-project#45345 vllm-project#43982 Co-authored-by: GitHub Copilot Signed-off-by: MingqiWang-coder <mingqiwang@hust.edu.cn>
MingqiWang-coder
added a commit
to vLLM-HUST/vllm-hust
that referenced
this pull request
Jul 2, 2026
Cherry-pick 62 bugfix/security PRs from upstream vllm-project/vllm main (2026-05-03 to 2026-06-17), covering scheduler, engine core, model runner, worker, attention, KV cache, compilation, and structured output fixes. Security (4): vllm-project#43286 vllm-project#44744 vllm-project#45118 vllm-project#45252 Bugfix (56): vllm-project#35536 vllm-project#36616 vllm-project#38895 vllm-project#39155 vllm-project#39324 vllm-project#39562 vllm-project#39805 vllm-project#40398 vllm-project#40726 Runner fix (2): vllm-project#44568 vllm-project#44603 Skipped: vllm-project#43781 (ROCm-specific, not applicable to Ascend NPU) Conflict resolutions: - Manual merge: vllm-project#43286 vllm-project#45118 vllm-project#42112 vllm-project#43160 vllm-project#43719 vllm-project#44560 - Upstream-preferred (-X theirs): vllm-project#43808 vllm-project#43988 vllm-project#42967 vllm-project#35536 vllm-project#45195 - Test files (--theirs): vllm-project#44744 vllm-project#41895 vllm-project#42040 vllm-project#41233 vllm-project#45345 vllm-project#43982 Co-authored-by: GitHub Copilot Signed-off-by: MingqiWang-coder <mingqiwang@hust.edu.cn>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
_reshape_kv_cache(V2 model runner) overruns the KV-cache buffer duringinitialization for compressed attention specs — specifically the
DeepSeek-V4 fp8 Lightning-Indexer cache. All workers crash at startup:
Reproducible with
VLLM_USE_V2_MODEL_RUNNER=1, a DeepSeek-V4 model,--block-size 256,--kv-cache-dtype fp8. The overrun factor is exactlycompress_ratio(128 =block_size 256 / kernel_block_size 2).Root cause
For a compressed
MLAAttentionSpec(DeepSeek-V4:compress_ratio = 128), a256-token logical block is stored as
storage_block_size = block_size // compress_ratio = 2slots, andkernel_block_sizeis set tostorage_block_size. But the kernel-block count per logical block was computedfrom
block_size:block_size // kernel_block_sizeequalscompress_ratiofor a compressed spec,so
kernel_num_blocks(hencekv_cache_shape[0]) iscompress_ratio× toolarge. The
page_size_paddedbranch then builds a strided view whose blockdimension steps one full padded page (
stride[0] = page_size_bytes // dtype_size) while iteratingcompress_ratio× too many blocks — walking pastthe allocation by exactly that factor. The branch's own comment notes it
"assumes
kv_cache_shape[0] == num_blocks", which this violates.The buffer holds
num_blockspadded pages (one per logical block); a compressedlogical block is exactly one kernel block (
storage_block_sizeslots) pluspadding, not
compress_ratioof them.Introduced by #38831, which changed
kv_cache_shape[0]fromnum_blockstokernel_num_blocksto support hybrid kernel block sizes. Uncompressed specswere unaffected because there
storage_block_size == block_size.Fix
Compute the count from
storage_block_size(what the buffer is actually laidout in) rather than
block_size:kernel_block_size == storage_block_size⇒
1.kv_cache_shape[0] == num_blocks, restoring the padded-strideinvariant and the pre-[ModelRunnerV2][Hybrid model] Support kernel block size in hybrid model #38831 shape.
KVCacheSpec.storage_block_sizereturnsblock_size,so the expression is byte-for-byte unchanged — no behavior change for any
non-compressed model.
Test plan / results
6,816,669,136B(matches the reported error) against a
53,255,232B allocation; the fixedview requires
53,254,672B — fits.VLLM_USE_V2_MODEL_RUNNER=1,--block-size 256,--kv-cache-dtype fp8): without the fix the servercrashes at KV-cache init (the trace above); with the fix it initializes the
fp8 indexer cache and reaches
Application startup complete.lm_eval):flexible-extract = 0.9567against the runningDeepSeek-V4-Flash server, confirming correct end-to-end execution.
Not a duplicate
Searched open PRs for
_reshape_kv_cache,storage_block_size,num_blocks_per_kv_block,page_size_padded, and DeepSeek-V4 KV cache. Thenearest, #43607, only optimizes
block_table.pyblock mapping and does nottouch
_reshape_kv_cache; no open PR addresses this overrun.Note
AI assistance was used for the root-cause analysis and the patch; the change has
been reviewed and is being submitted by the human author. Opened as a draft for
maintainer review.