[10c/n] Migrate MoE kernels to torch stable ABI by cleonard530 · Pull Request #44565 · vllm-project/vllm

cleonard530 · 2026-06-04T17:51:37Z

Purpose

This PR continues the libtorch stable ABI migration (see #26946) for vLLM MoE CUDA kernels by introducing _moe_C_stable_libtorch and moving all of the MoE ops (topk, align, permute/unpermute, grouped topk, and related headers) into csrc/libtorch_stable/moe/.

Note: started using the [10x/n] label to indicate that they could be merged in any order (theoretically, there could still be merge conflicts because of CMakeLists.txt, ops.h, and/or torch_binding.cpp files).

cc @janeyx99 @Harry-Chen

Test Plan

pytest tests/kernels/moe/test_moe_permute_unpermute.py 
pytest tests/kernels/moe/test_fused_topk.py 
pytest tests/kernels/moe/test_topk_softplus_sqrt.py 
pytest tests/kernels/moe/test_moe_align_block_size.py 
pytest tests/kernels/moe/test_grouped_topk.py 
pytest tests/kernels/moe/test_moe.py

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Migration progress using the Audit Python extension torch-abi-audit:

main branch

  -- extensions --
    [UNSTABLE] [abi3-ok               ] _C.abi3.so  (stable_shim=0, unstable=77)
    [STABLE  ] [abi3-ok               ] _C_stable_libtorch.abi3.so  (stable_shim=81, unstable=0)
    [UNSTABLE] [abi3-ok               ] _flashmla_C.abi3.so  (stable_shim=0, unstable=72)
    [UNSTABLE] [abi3-ok               ] _flashmla_extension_C.abi3.so  (stable_shim=0, unstable=68) 
->[UNSTABLE] [abi3-ok               ] _moe_C.abi3.so  (stable_shim=0, unstable=82)
    [NO-TORCH] [abi3-ok               ] cumem_allocator.abi3.so
    [NO-TORCH] [abi3-ok               ] spinloop.abi3.so
    [UNSTABLE] [uses-private-api      ] third_party/deep_gemm/_C.cpython-312-x86_64-linux-gnu.so  (stable_shim=0, unstable=57)
    [UNSTABLE] [abi3-ok               ] vllm_flash_attn/_vllm_fa2_C.abi3.so  (stable_shim=0, unstable=84)
    [UNSTABLE] [abi3-ok               ] vllm_flash_attn/_vllm_fa3_C.abi3.so  (stable_shim=0, unstable=80)

This branch

  -- extensions --
    [UNSTABLE] [abi3-ok               ] _C.abi3.so  (stable_shim=0, unstable=73)
    [STABLE  ] [abi3-ok               ] _C_stable_libtorch.abi3.so  (stable_shim=86, unstable=0)
    [UNSTABLE] [abi3-ok               ] _flashmla_C.abi3.so  (stable_shim=0, unstable=72)
    [UNSTABLE] [abi3-ok               ] _flashmla_extension_C.abi3.so  (stable_shim=0, unstable=68)
->[STABLE  ] [abi3-ok               ] _moe_C_stable_libtorch.abi3.so  (stable_shim=70, unstable=0)
    [NO-TORCH] [abi3-ok               ] cumem_allocator.abi3.so
    [NO-TORCH] [abi3-ok               ] spinloop.abi3.so
    [UNSTABLE] [uses-private-api      ] third_party/deep_gemm/_C.cpython-312-x86_64-linux-gnu.so  (stable_shim=0, unstable=57)
    [UNSTABLE] [abi3-ok               ] vllm_flash_attn/_vllm_fa2_C.abi3.so  (stable_shim=0, unstable=84)
    [UNSTABLE] [abi3-ok               ] vllm_flash_attn/_vllm_fa3_C.abi3.so  (stable_shim=0, unstable=80)

moved all of _moe_C to STABLE ABI

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

janeyx99 · 2026-06-04T21:07:46Z


 #include <cuda_bf16.h>
 #include <cuda_runtime.h>

-#include "core/registration.h"
-#include "dsv3_router_gemm_utils.h"


delete dsv3_router_gemm_utils.h as it now appears unused by anyone in the repo.

if i am wrong about that, then it'd be better to move it to stable and make the getSMVersion change in that file if it's used more widely.

janeyx99 · 2026-06-04T21:15:38Z

 }  // namespace vllm

-std::tuple<torch::Tensor, torch::Tensor> grouped_topk(
-    torch::Tensor const& scores, int64_t n_group, int64_t topk_group,


deleting const seems unintentional here?

The migration tool uses the convention const <type>& instead of <type> const& for any rewrites like this. I checked some other files in vLLM and there doesn't seem to be a standard convention for this code base on which way to write (both appear often).

I can change it back (and in other places) if you think that helps ease the migration though.

janeyx99 · 2026-06-04T21:17:14Z

+  auto topk_values = torch::stable::new_empty(
+      scores, {num_tokens, topk}, torch::headeronly::ScalarType::Float);
+  auto topk_indices = torch::stable::new_empty(
+      scores, {num_tokens, topk}, torch::headeronly::ScalarType::Int);


these tensors should be on cuda, no?

torch::stable::new_empty uses the device of scores which should be on cuda. From vllm/vllm/_custom_ops.py,

if not current_platform.is_cuda(): raise NotImplementedError( "The fused grouped_topk kernel is only available on CUDA platforms" )

Should I update this to be more explicit?

janeyx99 · 2026-06-04T21:45:36Z

+    torch::stable::Tensor sorted_token_ids, torch::stable::Tensor experts_ids,
+    torch::stable::Tensor num_tokens_post_pad,
+    std::optional<torch::stable::Tensor> maybe_expert_map) {
+  const torch::stable::accelerator::DeviceGuard device_guard(


janeyx99 · 2026-06-04T21:46:22Z

-              expert_map.data_ptr<int32_t>(), num_experts, block_size,
-              topk_ids.numel(), sorted_token_ids.size(0), topk_ids.size(1),
-              has_expert_map);
+              reinterpret_cast<const scalar_t*>(topk_ids.data_ptr()),


const or mutable?

janeyx99 · 2026-06-04T21:48:24Z

-#include "moeTopKFuncs.cuh"
-#include <c10/cuda/CUDAStream.h>
-#include <torch/all.h>
+#include "moe/moeTopKFuncs.cuh"


this file is all stable too, right?

Yes, moved it to the libtorch_stable directory.

janeyx99 · 2026-06-04T21:49:31Z

this file should have changes to become stable

janeyx99 · 2026-06-04T21:50:16Z

-              experts_per_warp, block_size, topk_ids.numel(),
-              cumsum_buffer.data_ptr<int32_t>(), sorted_token_ids.size(0),
-              topk_ids.size(1), has_expert_map);
+              reinterpret_cast<const scalar_t*>(topk_ids.data_ptr()),


same here, const or mutable

cleonard530 · 2026-06-09T19:08:38Z

+namespace {
+
+inline int getSMVersion() {
+  auto* props = get_device_prop();


This and only this used to be defined in "dsv3_router_gemm_utils.h"

Signed-off-by: Chris Leonard <chleonar@redhat.com>

cleonard530 · 2026-06-09T20:05:22Z

@Harry-Chen, all of moe has been moved over to the stable ABI. I renamed the library _moe_c_stable_libtorch.so to emphasize that it is stable, but I can change the name back if I need to.

cleonard530 · 2026-06-09T20:08:38Z

@Harry-Chen and @janeyx99, to look at the diff for the files that GitHub has marked as deleted/created when really they are just moved, check out the commit a9a466d#diff-ae0d2f513cdf90dbeae0b924311373baacb889d29a201155b02b51b6d023ee51 (ignore CMakeLists.txt on that commit though, it got reverted to main to make the final conversion easier. You can just look at the total diff for that file)

janeyx99

lgtm, pls check headers comments tho

janeyx99 · 2026-06-09T22:24:57Z

these are the headers in this now stable file:

#include "quantization/marlin/marlin.cuh" #include "quantization/marlin/marlin_dtypes.cuh" #include "core/scalar_type.hpp"

i think the latter 2 can be moved at least. and the first should as well?

These are still used by the kernel in csrc/quantization/marlin. This should be moved in the next PR so the headers should be moved over then.

janeyx99 · 2026-06-09T22:26:26Z

same q for migrating these

#include "quantization/marlin/marlin.cuh" #include "quantization/marlin/marlin_dtypes.cuh" #include "quantization/marlin/dequant.h" #include "quantization/marlin/marlin_mma.h" #include "core/scalar_type.hpp"

janeyx99 · 2026-06-09T22:32:44Z

-}
+STABLE_TORCH_LIBRARY_IMPL(_moe_C, CUDA, m) {
+  m.impl("moe_wna16_marlin_gemm", TORCH_BOX(&moe_wna16_marlin_gemm));
+}


add new line back

janeyx99 · 2026-06-09T22:37:26Z

+
+#include <torch/csrc/stable/tensor.h>
+
+#include "core/scalar_type.hpp"


might not be needed

…aderonly::ScalarType::Float8_e8m0fnu Signed-off-by: Chris Leonard <chleonar@redhat.com>

cleonard530 · 2026-06-10T13:41:31Z

@Harry-Chen, the failures were because torch::headeronly::ScalarType::Float8_e8m0fnu is not supported on torch 2.10, so I moved the _moe_C_stable_libtorch library up to torch 2.11. Let me know if there are any issues with that.

janeyx99 · 2026-06-10T14:26:33Z

@Harry-Chen, the failures were because torch::headeronly::ScalarType::Float8_e8m0fnu is not supported on torch 2.10, so I moved the _moe_C_stable_libtorch library up to torch 2.11. Let me know if there are any issues with that.

It appears ROCm doesn't support 2.11 in CI yet, which may affect whether we need to handle this migration differently. @Harry-Chen Do you know if there's an ongoing effort to migrate ROCm CI and the timeline for it?

cleonard530 · 2026-06-10T15:00:31Z

@Harry-Chen, the failures were because torch::headeronly::ScalarType::Float8_e8m0fnu is not supported on torch 2.10, so I moved the _moe_C_stable_libtorch library up to torch 2.11. Let me know if there are any issues with that.

It appears ROCm doesn't support 2.11 in CI yet, which may affect whether we need to handle this migration differently. @Harry-Chen Do you know if there's an ongoing effort to migrate ROCm CI and the timeline for it?

I think the errors were caused by moe_wna16_marlin_gemm which is inside:

#ifndef USE_ROCM
  m.def(
      "moe_wna16_gemm(...

in csrc/libtorch_stable/moe/torch_bindings.cpp so I don't think we will have a problem with rocm here. I tried to parse the code to see if rocm uses float8_e8m0fnu or float8_e8m0fnu anywhere on the stable path and I couldn't find anything.

Harry-Chen · 2026-06-10T15:00:34Z

@Harry-Chen, the failures were because torch::headeronly::ScalarType::Float8_e8m0fnu is not supported on torch 2.10, so I moved the _moe_C_stable_libtorch library up to torch 2.11. Let me know if there are any issues with that.

It appears ROCm doesn't support 2.11 in CI yet, which may affect whether we need to handle this migration differently. @Harry-Chen Do you know if there's an ongoing effort to migrate ROCm CI and the timeline for it?

I do not have information on this either. CC @AndreasKaratzas @tjtanaa who may know something on this

micah-wil · 2026-06-10T18:26:54Z

@Harry-Chen, the failures were because torch::headeronly::ScalarType::Float8_e8m0fnu is not supported on torch 2.10, so I moved the _moe_C_stable_libtorch library up to torch 2.11. Let me know if there are any issues with that.

It appears ROCm doesn't support 2.11 in CI yet, which may affect whether we need to handle this migration differently. @Harry-Chen Do you know if there's an ongoing effort to migrate ROCm CI and the timeline for it?

I do not have information on this either. CC @AndreasKaratzas @tjtanaa who may know something on this

Hi @Harry-Chen, we are currently giving 2.11 another round of testing after getting some fixes in. We'll bump as soon as we can- hopefully this week as long as things check out. Here's our latest torch 2.11 build in amd-ci: https://buildkite.com/vllm/amd-ci/builds/9408

cleonard530 · 2026-06-10T20:20:37Z

@Harry-Chen, none of these failures seems to be coming from this PR.

Signed-off-by: Chris Leonard <chleonar@redhat.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com>

Signed-off-by: Chris Leonard <chleonar@redhat.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com> Signed-off-by: divineearthly <divineearthly@gmail.com>

Signed-off-by: Chris Leonard <chleonar@redhat.com> Co-authored-by: Shengqi Chen <harry-chen@outlook.com>

cleonard530 requested review from Harry-Chen, LucasWilkinson, dllehr-amd, hmellor, khluu, tjtanaa and tlrmchlsmth as code owners June 4, 2026 17:51

claude Bot reviewed Jun 4, 2026

View reviewed changes

mergify Bot added ci/build nvidia rocm Related to AMD ROCm labels Jun 4, 2026

github-project-automation Bot added this to NVIDIA and AMD Jun 4, 2026

github-project-automation Bot moved this to Todo in AMD Jun 4, 2026