Skip to content

[Bugfix] Add deepseek_v32 to Quark dynamic MXFP4 model type check#39498

Merged
tjtanaa merged 15 commits into
vllm-project:mainfrom
shantipriya-amd:fix-quark-deepseek-v32
Jun 10, 2026
Merged

[Bugfix] Add deepseek_v32 to Quark dynamic MXFP4 model type check#39498
tjtanaa merged 15 commits into
vllm-project:mainfrom
shantipriya-amd:fix-quark-deepseek-v32

Conversation

@shantipriya-amd

Copy link
Copy Markdown
Contributor

Summary

Add deepseek_v32 to _DEEPSEEK_V3_FAMILY_MODEL_TYPES so that Quark correctly enables dynamic_mxfp4_quant for amd/DeepSeek-V3.2-mxfp4.

Without this fix, excluded layers (e.g. self_attn) are not dynamically re-quantized when serving MXFP4-quantized DeepSeek-V3.2 checkpoints, causing silent correctness degradation (all-zero or garbage output tokens).

Root cause

_DEEPSEEK_V3_FAMILY_MODEL_TYPES is defined as:

_DEEPSEEK_V3_FAMILY_MODEL_TYPES = frozenset({"deepseek_v3"})

The maybe_update_config() method uses this set to gate dynamic_mxfp4_quant. DeepSeek-V3.2 uses model_type = "deepseek_v32" in its HuggingFace config, which is not in the set. As a result, dynamic_mxfp4_quant stays False, and get_quant_method() returns UnquantizedLinearMethod() for attention projections instead of applying the correct quantization scheme.

Fix

- _DEEPSEEK_V3_FAMILY_MODEL_TYPES = frozenset({"deepseek_v3"})
+ _DEEPSEEK_V3_FAMILY_MODEL_TYPES = frozenset({"deepseek_v3", "deepseek_v32"})

One-line change. Adding "deepseek_v32" to the frozen set enables the dynamic MXFP4 path for the V3.2 model family while preserving existing V3 behavior.

Validation

Validated on 8× AMD MI355X (gfx950) with amd/DeepSeek-V3.2-mxfp4, TP=8, attention_backend=ROCM_AITER_MLA_SPARSE, --enforce-eager.

Configuration

--tensor-parallel-size 8
--max-model-len 8192
--block-size 1
--kv-cache-dtype bfloat16
--gpu-memory-utilization 0.9
--no-enable-prefix-caching
--enforce-eager

Functional Tests

Test Result
/v1/models endpoint ✅ Model listed successfully
Basic math (What is 2+2?) ✅ Correct: 2 + 2 = 4
Short summarization (JWST paragraph → 2 sentences) ✅ Coherent 62-token summary, finish_reason=stop
Long summarization (AI history → 3-4 paragraphs) ✅ Coherent 934-token / 4-paragraph summary, finish_reason=stop

GSM8K (lm-eval, 5-shot)

Metric Value
Questions 1319
Few-shot 5
Strict-match accuracy 0.9515 ± 0.0059

Throughput Benchmark

512 random prompts (~4503 input / ~140 output tokens), concurrency 128:

Metric Value
Successful requests 479 / 512
Request throughput 2.12 req/s
Output token throughput 297 tok/s
Peak output token throughput 861 tok/s
Total token throughput 9689 tok/s
TTFT p50 / p95 1583 / 29246 ms
TPOT p50 / p95 342 / 430 ms

Notes

  • This fix was validated alongside [Bugfix] Zero-init ROCm MLA attention output buffers for graph padding #37682 (MLA zero-init). Both are required for correct DeepSeek-V3.2-mxfp4 serving on ROCm.
  • The deepseek_v32 model type is used by the official HuggingFace checkpoint at amd/DeepSeek-V3.2-mxfp4.
  • No changes to test files needed — existing Quark tests pass as-is since this only widens a model-type gate.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the _DEEPSEEK_V3_FAMILY_MODEL_TYPES constant in the Quark quantization configuration to include deepseek_v32, extending support for dynamic MXFP4 re-quantization to this model variant. I have no feedback to provide.

@mergify mergify Bot added deepseek Related to DeepSeek models bug Something isn't working labels Apr 10, 2026

@BowenBao BowenBao left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks

Do you have the before and after throughput comparison?

@tjtanaa tjtanaa added the rocm Related to AMD ROCm label May 3, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD May 3, 2026

@tjtanaa tjtanaa left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shantipriya-amd please rebase. It looks good to me.

@mergify

mergify Bot commented May 3, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @shantipriya-amd.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 3, 2026
@shantipriya-amd shantipriya-amd force-pushed the fix-quark-deepseek-v32 branch from f6a7eba to f25db19 Compare May 4, 2026 07:01
@mergify mergify Bot removed the needs-rebase label May 4, 2026
@shantipriya-amd

Copy link
Copy Markdown
Contributor Author

Rebased onto current main — conflict resolved (new HEAD: f25db19). The merge conflict in quark.py was caused by maybe_update_config() being restructured upstream; the rebased branch restores it with _DEEPSEEK_V3_FAMILY_MODEL_TYPES = frozenset({"deepseek_v3", "deepseek_v32"}).

Verification: py_compile passed, 5 behavioral test cases all pass — deepseek_v3/deepseek_v32 fp4 → enabled ✅, non-fp4 / other model types → no-op ✅.

@shantipriya-amd

Copy link
Copy Markdown
Contributor Author

Rebased onto current main — conflict resolved (new HEAD: f25db194).

The merge conflict in quark.py was caused by maybe_update_config() being restructured upstream; the rebased branch restores it with:

_DEEPSEEK_V3_FAMILY_MODEL_TYPES = frozenset({"deepseek_v3", "deepseek_v32"})

Verification: py_compile passed ✅, 5 behavioral test cases all pass:

  • deepseek_v3 + fp4 weight → dynamic_mxfp4_quant = True
  • deepseek_v32 + fp4 weight → dynamic_mxfp4_quant = True
  • deepseek_v3 + non-fp4 weight → no-op ✅
  • deepseek_v32 + no quantization config → no-op ✅
  • other model type → no-op ✅
@shantipriya-amd

Copy link
Copy Markdown
Contributor Author

@mergify requeue

@mergify

mergify Bot commented May 4, 2026

Copy link
Copy Markdown
Contributor

requeue

☑️ Command disallowed due to command restrictions in the Mergify configuration.

Details
  • sender-permission >= write
@shantipriya-amd

Copy link
Copy Markdown
Contributor Author

Validation Results

Validated on 8× AMD MI355X (gfx950) with amd/DeepSeek-V3.2-mxfp4, TP=8, attention_backend=ROCM_AITER_MLA_SPARSE, --enforce-eager.

Configuration

--tensor-parallel-size 8
--max-model-len 8192
--block-size 1
--kv-cache-dtype bfloat16
--gpu-memory-utilization 0.9
--no-enable-prefix-caching
--enforce-eager

Functional Tests

Test Result
/v1/models endpoint ✅ Model listed successfully
Basic math (What is 2+2?) ✅ Correct: 2 + 2 = 4
Short summarization (JWST paragraph → 2 sentences) ✅ Coherent 62-token summary, finish_reason=stop
Long summarization (AI history → 3-4 paragraphs) ✅ Coherent 934-token / 4-paragraph summary, finish_reason=stop

GSM8K (lm-eval, 5-shot)

Metric Value
Questions 1319
Few-shot 5
Strict-match accuracy 0.9515 ± 0.0059

Throughput Benchmark

512 random prompts (~4503 input / ~140 output tokens), concurrency 128:

Metric Value
Successful requests 479 / 512
Request throughput 2.12 req/s
Output token throughput 297 tok/s
Peak output token throughput 861 tok/s
Total token throughput 9689 tok/s
TTFT p50 / p95 1583 / 29246 ms
TPOT p50 / p95 342 / 430 ms
@shantipriya-amd shantipriya-amd force-pushed the fix-quark-deepseek-v32 branch 2 times, most recently from 2da0029 to 59e5331 Compare May 4, 2026 09:34
@dllehr-amd

Copy link
Copy Markdown
Collaborator

@tjtanaa or @gshtras can you put the ready label on this PR?

@gshtras gshtras enabled auto-merge (squash) May 5, 2026 21:40
@github-actions github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label May 5, 2026
@mergify

mergify Bot commented May 8, 2026

Copy link
Copy Markdown
Contributor

Hi @shantipriya-amd, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
auto-merge was automatically disabled May 8, 2026 08:14

Head branch was pushed to by a user without write access

@shantipriya-amd shantipriya-amd force-pushed the fix-quark-deepseek-v32 branch 2 times, most recently from fb661f7 to 79fa7c7 Compare May 15, 2026 07:27
input_config = cast(dict[str, Any], config.get("input_tensors"))

if self._is_nvfp4(weight_config, input_config):
return QuarkNVFP4()

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shantipriya-amd was there a time where quark store fp4 weights using nvfp4 format? and does this mean right now quark models store mxfp4 weight directly?

@BowenBao could you take a look as well?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tjtanaa — yes to both questions:

Was NVFP4 ever used?
Yes — Quark has supported NVFP4 since#35859 and it remains a live code path today.

Do current Quark MXFP4 models store weights directly in MXFP4?
Yes — amd/DeepSeek-V3.2-mxfp4 stores MoE and FFN weights on disk in MXFP4 (OCP-MX format: group_size=32, scale_format=e8m0, dtype=fp4). This hits the _is_w_ocp_mx_a_x() path → QuarkOCP_MX.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @BowenBao could you please confirm?

@AndreasKaratzas

Copy link
Copy Markdown
Member

Could you please rebase?

Add null check for hf_config and type check for quant_config to ensure
robustness when handling optional configurations. Fixes pre-commit checks and
addresses code review feedback for safer attribute access.

Signed-off-by: Shantipriya Parida <shantipriya.parida@amd.com>
@shantipriya-amd shantipriya-amd force-pushed the fix-quark-deepseek-v32 branch from d46cb9e to ec037ad Compare May 29, 2026 12:57
@shantipriya-amd

Copy link
Copy Markdown
Contributor Author

@AndreasKaratzas : Thanks for your suggestion, tried to rebase and update the branch but merging still pending.

@Rohan138

Rohan138 commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

@shantipriya-amd looks like #42754 is identical now and no longer needed, can you close the duplicate PR?

@shantipriya-amd

Copy link
Copy Markdown
Contributor Author

@Rohan138 : Closed the duplicate PR (#42754). Any suggestion about merging this PR.

@mergify

mergify Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Hi @shantipriya-amd, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

@tjtanaa tjtanaa enabled auto-merge (squash) June 9, 2026 17:10
@shantipriya-amd

Copy link
Copy Markdown
Contributor Author

@tjtanaa — thank you for enabling auto-merge! Two CI checks are currently failing but both appear to be infrastructure issues unrelated to this one-line frozenset change:

amd-docker-build-test-image-and-artifacts (exit 1) — docker build failure, likely an upstream issue affecting multiple PRs
cpu-language-generation-and-pooling-model-tests (exit 1) — CPU test, unrelated to the Quark MXFP4 ROCm code path

Could you confirm if these are pre-existing failures on main and re-trigger CI when the infrastructure is stable? Happy to rebase again if needed.

@tjtanaa tjtanaa merged commit a1ec011 into vllm-project:main Jun 10, 2026
83 checks passed
@github-project-automation github-project-automation Bot moved this from Todo to Done in AMD Jun 10, 2026
Saddss pushed a commit to Saddss/vllm that referenced this pull request Jun 14, 2026
…lm-project#39498)

Signed-off-by: Shantipriya Parida <shantipriya.parida@amd.com>
vivek8123 pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Jun 18, 2026
…lm-project#39498)

Signed-off-by: Shantipriya Parida <shantipriya.parida@amd.com>
divineearthly pushed a commit to divineearthly/vllm that referenced this pull request Jun 19, 2026
…lm-project#39498)

Signed-off-by: Shantipriya Parida <shantipriya.parida@amd.com>
Signed-off-by: divineearthly <divineearthly@gmail.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026
…lm-project#39498)

Signed-off-by: Shantipriya Parida <shantipriya.parida@amd.com>
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
…lm-project#39498)

Signed-off-by: Shantipriya Parida <shantipriya.parida@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working deepseek Related to DeepSeek models ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

6 participants