[Bugfix] Add deepseek_v32 to Quark dynamic MXFP4 model type check by shantipriya-amd · Pull Request #39498 · vllm-project/vllm

shantipriya-amd · 2026-04-10T13:49:24Z

Summary

Add deepseek_v32 to _DEEPSEEK_V3_FAMILY_MODEL_TYPES so that Quark correctly enables dynamic_mxfp4_quant for amd/DeepSeek-V3.2-mxfp4.

Without this fix, excluded layers (e.g. self_attn) are not dynamically re-quantized when serving MXFP4-quantized DeepSeek-V3.2 checkpoints, causing silent correctness degradation (all-zero or garbage output tokens).

Root cause

_DEEPSEEK_V3_FAMILY_MODEL_TYPES is defined as:

_DEEPSEEK_V3_FAMILY_MODEL_TYPES = frozenset({"deepseek_v3"})

The maybe_update_config() method uses this set to gate dynamic_mxfp4_quant. DeepSeek-V3.2 uses model_type = "deepseek_v32" in its HuggingFace config, which is not in the set. As a result, dynamic_mxfp4_quant stays False, and get_quant_method() returns UnquantizedLinearMethod() for attention projections instead of applying the correct quantization scheme.

Fix

- _DEEPSEEK_V3_FAMILY_MODEL_TYPES = frozenset({"deepseek_v3"})
+ _DEEPSEEK_V3_FAMILY_MODEL_TYPES = frozenset({"deepseek_v3", "deepseek_v32"})

One-line change. Adding "deepseek_v32" to the frozen set enables the dynamic MXFP4 path for the V3.2 model family while preserving existing V3 behavior.

Validation

Validated on 8× AMD MI355X (gfx950) with amd/DeepSeek-V3.2-mxfp4, TP=8, attention_backend=ROCM_AITER_MLA_SPARSE, --enforce-eager.

Configuration

--tensor-parallel-size 8
--max-model-len 8192
--block-size 1
--kv-cache-dtype bfloat16
--gpu-memory-utilization 0.9
--no-enable-prefix-caching
--enforce-eager

Functional Tests

Test	Result
`/v1/models` endpoint	✅ Model listed successfully
Basic math (`What is 2+2?`)	✅ Correct: `2 + 2 = 4`
Short summarization (JWST paragraph → 2 sentences)	✅ Coherent 62-token summary, `finish_reason=stop`
Long summarization (AI history → 3-4 paragraphs)	✅ Coherent 934-token / 4-paragraph summary, `finish_reason=stop`

GSM8K (lm-eval, 5-shot)

Metric	Value
Questions	1319
Few-shot	5
Strict-match accuracy	0.9515 ± 0.0059

Throughput Benchmark

512 random prompts (~4503 input / ~140 output tokens), concurrency 128:

Metric	Value
Successful requests	479 / 512
Request throughput	2.12 req/s
Output token throughput	297 tok/s
Peak output token throughput	861 tok/s
Total token throughput	9689 tok/s
TTFT p50 / p95	1583 / 29246 ms
TPOT p50 / p95	342 / 430 ms

Notes

This fix was validated alongside [Bugfix] Zero-init ROCm MLA attention output buffers for graph padding #37682 (MLA zero-init). Both are required for correct DeepSeek-V3.2-mxfp4 serving on ROCm.
The deepseek_v32 model type is used by the official HuggingFace checkpoint at amd/DeepSeek-V3.2-mxfp4.
No changes to test files needed — existing Quark tests pass as-is since this only widens a model-type gate.

gemini-code-assist

Code Review

This pull request updates the _DEEPSEEK_V3_FAMILY_MODEL_TYPES constant in the Quark quantization configuration to include deepseek_v32, extending support for dynamic MXFP4 re-quantization to this model variant. I have no feedback to provide.

BowenBao

LGTM, thanks

Do you have the before and after throughput comparison?

tjtanaa

@shantipriya-amd please rebase. It looks good to me.

mergify · 2026-05-03T23:53:32Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @shantipriya-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

shantipriya-amd · 2026-05-04T07:14:04Z

Rebased onto current main — conflict resolved (new HEAD: f25db19). The merge conflict in quark.py was caused by maybe_update_config() being restructured upstream; the rebased branch restores it with _DEEPSEEK_V3_FAMILY_MODEL_TYPES = frozenset({"deepseek_v3", "deepseek_v32"}).

Verification: py_compile passed, 5 behavioral test cases all pass — deepseek_v3/deepseek_v32 fp4 → enabled ✅, non-fp4 / other model types → no-op ✅.

shantipriya-amd · 2026-05-04T07:19:00Z

Rebased onto current main — conflict resolved (new HEAD: f25db194).

The merge conflict in quark.py was caused by maybe_update_config() being restructured upstream; the rebased branch restores it with:

_DEEPSEEK_V3_FAMILY_MODEL_TYPES = frozenset({"deepseek_v3", "deepseek_v32"})

Verification: py_compile passed ✅, 5 behavioral test cases all pass:

deepseek_v3 + fp4 weight → dynamic_mxfp4_quant = True ✅
deepseek_v32 + fp4 weight → dynamic_mxfp4_quant = True ✅
deepseek_v3 + non-fp4 weight → no-op ✅
deepseek_v32 + no quantization config → no-op ✅
other model type → no-op ✅

shantipriya-amd · 2026-05-04T07:19:16Z

@mergify requeue

mergify · 2026-05-04T07:19:36Z

requeue

☑️ Command disallowed due to command restrictions in the Mergify configuration.

Details

sender-permission >= write

shantipriya-amd · 2026-05-04T07:26:43Z

Validation Results

Validated on 8× AMD MI355X (gfx950) with amd/DeepSeek-V3.2-mxfp4, TP=8, attention_backend=ROCM_AITER_MLA_SPARSE, --enforce-eager.

Configuration

--tensor-parallel-size 8
--max-model-len 8192
--block-size 1
--kv-cache-dtype bfloat16
--gpu-memory-utilization 0.9
--no-enable-prefix-caching
--enforce-eager

Functional Tests

Test	Result
`/v1/models` endpoint	✅ Model listed successfully
Basic math (`What is 2+2?`)	✅ Correct: `2 + 2 = 4`
Short summarization (JWST paragraph → 2 sentences)	✅ Coherent 62-token summary, `finish_reason=stop`
Long summarization (AI history → 3-4 paragraphs)	✅ Coherent 934-token / 4-paragraph summary, `finish_reason=stop`

GSM8K (lm-eval, 5-shot)

Metric	Value
Questions	1319
Few-shot	5
Strict-match accuracy	0.9515 ± 0.0059

Throughput Benchmark

512 random prompts (~4503 input / ~140 output tokens), concurrency 128:

Metric	Value
Successful requests	479 / 512
Request throughput	2.12 req/s
Output token throughput	297 tok/s
Peak output token throughput	861 tok/s
Total token throughput	9689 tok/s
TTFT p50 / p95	1583 / 29246 ms
TPOT p50 / p95	342 / 430 ms

dllehr-amd · 2026-05-04T23:40:45Z

@tjtanaa or @gshtras can you put the ready label on this PR?

mergify · 2026-05-08T07:33:02Z

Hi @shantipriya-amd, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

tjtanaa · 2026-05-26T07:48:11Z

        input_config = cast(dict[str, Any], config.get("input_tensors"))

-        if self._is_nvfp4(weight_config, input_config):
-            return QuarkNVFP4()


@shantipriya-amd was there a time where quark store fp4 weights using nvfp4 format? and does this mean right now quark models store mxfp4 weight directly?

@BowenBao could you take a look as well?

@tjtanaa — yes to both questions:

Was NVFP4 ever used?
Yes — Quark has supported NVFP4 since#35859 and it remains a live code path today.

Do current Quark MXFP4 models store weights directly in MXFP4?
Yes — amd/DeepSeek-V3.2-mxfp4 stores MoE and FFN weights on disk in MXFP4 (OCP-MX format: group_size=32, scale_format=e8m0, dtype=fp4). This hits the _is_w_ocp_mx_a_x() path → QuarkOCP_MX.

cc @BowenBao could you please confirm?

AndreasKaratzas · 2026-05-28T17:18:01Z

Could you please rebase?

Add null check for hf_config and type check for quant_config to ensure robustness when handling optional configurations. Fixes pre-commit checks and addresses code review feedback for safer attribute access. Signed-off-by: Shantipriya Parida <shantipriya.parida@amd.com>

shantipriya-amd · 2026-06-03T10:07:15Z

@AndreasKaratzas : Thanks for your suggestion, tried to rebase and update the branch but merging still pending.

Rohan138 · 2026-06-03T18:41:44Z

@shantipriya-amd looks like #42754 is identical now and no longer needed, can you close the duplicate PR?

shantipriya-amd · 2026-06-04T07:19:05Z

@Rohan138 : Closed the duplicate PR (#42754). Any suggestion about merging this PR.

mergify · 2026-06-08T15:03:27Z

Hi @shantipriya-amd, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

shantipriya-amd · 2026-06-09T18:10:12Z

@tjtanaa — thank you for enabling auto-merge! Two CI checks are currently failing but both appear to be infrastructure issues unrelated to this one-line frozenset change:

amd-docker-build-test-image-and-artifacts (exit 1) — docker build failure, likely an upstream issue affecting multiple PRs
cpu-language-generation-and-pooling-model-tests (exit 1) — CPU test, unrelated to the Quark MXFP4 ROCm code path

Could you confirm if these are pre-existing failures on main and re-trigger CI when the infrastructure is stable? Happy to rebase again if needed.

…lm-project#39498) Signed-off-by: Shantipriya Parida <shantipriya.parida@amd.com>

…lm-project#39498) Signed-off-by: Shantipriya Parida <shantipriya.parida@amd.com> Signed-off-by: divineearthly <divineearthly@gmail.com>

…lm-project#39498) Signed-off-by: Shantipriya Parida <shantipriya.parida@amd.com>

shantipriya-amd requested a review from tjtanaa as a code owner April 10, 2026 13:49

gemini-code-assist Bot reviewed Apr 10, 2026

View reviewed changes

shantipriya-amd mentioned this pull request Apr 10, 2026

[Bugfix] Zero-init ROCm MLA attention output buffers for graph padding #37682

Open

mergify Bot added deepseek Related to DeepSeek models bug Something isn't working labels Apr 10, 2026

BowenBao approved these changes Apr 24, 2026

View reviewed changes

tjtanaa added the rocm Related to AMD ROCm label May 3, 2026

github-project-automation Bot added this to AMD May 3, 2026

github-project-automation Bot moved this to Todo in AMD May 3, 2026

tjtanaa approved these changes May 3, 2026

View reviewed changes

mergify Bot added the needs-rebase label May 3, 2026

shantipriya-amd force-pushed the fix-quark-deepseek-v32 branch from f6a7eba to f25db19 Compare May 4, 2026 07:01

mergify Bot removed the needs-rebase label May 4, 2026

shantipriya-amd force-pushed the fix-quark-deepseek-v32 branch 2 times, most recently from 2da0029 to 59e5331 Compare May 4, 2026 09:34

gshtras enabled auto-merge (squash) May 5, 2026 21:40

github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label May 5, 2026

auto-merge was automatically disabled May 8, 2026 08:14
Head branch was pushed to by a user without write access

shantipriya-amd force-pushed the fix-quark-deepseek-v32 branch 2 times, most recently from fb661f7 to 79fa7c7 Compare May 15, 2026 07:27

shantipriya-amd mentioned this pull request May 15, 2026

[Bugfix] Add deepseek_v32 to Quark dynamic MXFP4 model type check #42754

Closed

tjtanaa reviewed May 26, 2026

View reviewed changes

shantipriya-amd force-pushed the fix-quark-deepseek-v32 branch from d46cb9e to ec037ad Compare May 29, 2026 12:57

shantipriya-amd and others added 5 commits May 29, 2026 17:57

Merge branch 'main' into fix-quark-deepseek-v32

259dc1e

Merge branch 'main' into fix-quark-deepseek-v32

d788304

Merge branch 'main' into fix-quark-deepseek-v32

807b493

Merge branch 'main' into fix-quark-deepseek-v32

f6ac230

Merge branch 'main' into fix-quark-deepseek-v32

6c2d9ba

shantipriya-amd added 2 commits June 5, 2026 14:58

Merge branch 'main' into fix-quark-deepseek-v32

0842789

Merge branch 'main' into fix-quark-deepseek-v32

92bbec9

shantipriya-amd and others added 2 commits June 9, 2026 19:06

Merge branch 'main' into fix-quark-deepseek-v32

cf9f58d

Merge branch 'main' into fix-quark-deepseek-v32

d22d436

tjtanaa enabled auto-merge (squash) June 9, 2026 17:10

shantipriya-amd added 2 commits June 9, 2026 20:42

Merge branch 'main' into fix-quark-deepseek-v32

07f0870

Merge branch 'main' into fix-quark-deepseek-v32

bbdf83b

shantipriya-amd added 3 commits June 9, 2026 21:34

Merge branch 'main' into fix-quark-deepseek-v32

e0a519f

Merge branch 'main' into fix-quark-deepseek-v32

4b6eddb

Merge branch 'main' into fix-quark-deepseek-v32

ea83761

tjtanaa merged commit a1ec011 into vllm-project:main Jun 10, 2026
83 checks passed

github-project-automation Bot moved this from Todo to Done in AMD Jun 10, 2026

Saddss pushed a commit to Saddss/vllm that referenced this pull request Jun 14, 2026

[Bugfix] Add deepseek_v32 to Quark dynamic MXFP4 model type check (vl…

6299b2c

…lm-project#39498) Signed-off-by: Shantipriya Parida <shantipriya.parida@amd.com>

vivek8123 pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Jun 18, 2026

[Bugfix] Add deepseek_v32 to Quark dynamic MXFP4 model type check (vl…

f704939

…lm-project#39498) Signed-off-by: Shantipriya Parida <shantipriya.parida@amd.com>

tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026

[Bugfix] Add deepseek_v32 to Quark dynamic MXFP4 model type check (vl…

7471835

…lm-project#39498) Signed-off-by: Shantipriya Parida <shantipriya.parida@amd.com>

nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026

[Bugfix] Add deepseek_v32 to Quark dynamic MXFP4 model type check (vl…

e91470d

…lm-project#39498) Signed-off-by: Shantipriya Parida <shantipriya.parida@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix] Add deepseek_v32 to Quark dynamic MXFP4 model type check#39498

[Bugfix] Add deepseek_v32 to Quark dynamic MXFP4 model type check#39498
tjtanaa merged 15 commits into
vllm-project:mainfrom
shantipriya-amd:fix-quark-deepseek-v32

shantipriya-amd commented Apr 10, 2026

gemini-code-assist Bot left a comment

BowenBao left a comment

tjtanaa left a comment

mergify Bot commented May 3, 2026

shantipriya-amd commented May 4, 2026

shantipriya-amd commented May 4, 2026

shantipriya-amd commented May 4, 2026

mergify Bot commented May 4, 2026

shantipriya-amd commented May 4, 2026

dllehr-amd commented May 4, 2026

mergify Bot commented May 8, 2026

tjtanaa May 26, 2026

shantipriya-amd May 26, 2026

AndreasKaratzas May 29, 2026

AndreasKaratzas commented May 28, 2026

shantipriya-amd commented Jun 3, 2026

Rohan138 commented Jun 3, 2026

shantipriya-amd commented Jun 4, 2026

mergify Bot commented Jun 8, 2026

shantipriya-amd commented Jun 9, 2026

Uh oh!

Labels

6 participants

Uh oh!

Uh oh!

Conversation

shantipriya-amd commented Apr 10, 2026

Summary

Root cause

Fix

Validation

Configuration

Functional Tests

GSM8K (lm-eval, 5-shot)

Throughput Benchmark

Notes

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

BowenBao left a comment

Choose a reason for hiding this comment

tjtanaa left a comment

Choose a reason for hiding this comment

mergify Bot commented May 3, 2026

shantipriya-amd commented May 4, 2026

shantipriya-amd commented May 4, 2026

shantipriya-amd commented May 4, 2026

mergify Bot commented May 4, 2026

☑️ Command disallowed due to command restrictions in the Mergify configuration.

shantipriya-amd commented May 4, 2026

Validation Results

Configuration

Functional Tests

GSM8K (lm-eval, 5-shot)

Throughput Benchmark

dllehr-amd commented May 4, 2026

mergify Bot commented May 8, 2026

tjtanaa May 26, 2026

Choose a reason for hiding this comment

shantipriya-amd May 26, 2026

Choose a reason for hiding this comment

AndreasKaratzas May 29, 2026

Choose a reason for hiding this comment

AndreasKaratzas commented May 28, 2026

shantipriya-amd commented Jun 3, 2026

Rohan138 commented Jun 3, 2026

shantipriya-amd commented Jun 4, 2026

mergify Bot commented Jun 8, 2026

shantipriya-amd commented Jun 9, 2026

Uh oh!

Labels

6 participants