[Bugfix] Re-enable FP8 MoE on NVIDIA Thor#46339
Conversation
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
| bool per_out_ch) { | ||
| int32_t version_num = get_sm_version_num(); | ||
| #if defined ENABLE_CUTLASS_MOE_SM100 && ENABLE_CUTLASS_MOE_SM100 | ||
| if (version_num >= 100 && version_num < 110) { |
There was a problem hiding this comment.
I think cutlass_moe_mm_sm100 from grouped_mm_c3x_sm100.cu (on which you changed CUDA targets) is actually guarded here?
There was a problem hiding this comment.
Yes, but the code from your PR essentially disables the ENABLE_CUTLASS_MOE_SM100 flag for SM110, which in turn makes cutlass_group_gemm_supported resolve to false and prevents the model from being run
There was a problem hiding this comment.
Oh, I see...Sorry for the breakage. Can we add something like ENABLE_CUTLASS_MOE_SM110 or so to make its support clearer?
There was a problem hiding this comment.
Unless you decide to create a new set of kernels for SM110 as well, I think breaking the 1-to-1 mapping between files and kernels would introduce some confusion. But adding a new set of kernels would also introduce a bunch of duplicate code. So I prefer to just keep it as-is.
There was a problem hiding this comment.
Another way would be to simply rename all relevant flags/kernels to use sm100_to_110 instead of sm100
There was a problem hiding this comment.
Yeah I think something like sm100_or_110 makes more sense to me.
|
Is it ok to merge this first so we can cherry-pick this into v0.24? |
Sure I think so. |
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> (cherry picked from commit 24d5186)
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk> Signed-off-by: Qiang Li <qiang.li2@amd.com>
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
|
@DarkLight1337 Were you testing this PR on CUDA 12.9? |
|
No, I was on CUDA 13 |
Purpose
Partially revert a change in #45277 which broke
Qwen/Qwen3.5-35B-A3B-FP8inference on NVIDIA Thor (SM101 for CUDA 12 and SM110 for CUDA 13). This parallels howcutlass_3x_gemm_sm100_fp8is also enabled for this architecture.Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.