Skip to content

[10c/n] Migrate MoE kernels to torch stable ABI #44565

Merged
ywang96 merged 19 commits into
vllm-project:mainfrom
cleonard530:new-stable-abi-phase10c
Jun 11, 2026
Merged

[10c/n] Migrate MoE kernels to torch stable ABI #44565
ywang96 merged 19 commits into
vllm-project:mainfrom
cleonard530:new-stable-abi-phase10c

Conversation

@cleonard530

@cleonard530 cleonard530 commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Purpose

This PR continues the libtorch stable ABI migration (see #26946) for vLLM MoE CUDA kernels by introducing _moe_C_stable_libtorch and moving all of the MoE ops (topk, align, permute/unpermute, grouped topk, and related headers) into csrc/libtorch_stable/moe/.

Note: started using the [10x/n] label to indicate that they could be merged in any order (theoretically, there could still be merge conflicts because of CMakeLists.txt, ops.h, and/or torch_binding.cpp files).

cc @janeyx99 @Harry-Chen

Test Plan

pytest tests/kernels/moe/test_moe_permute_unpermute.py 
pytest tests/kernels/moe/test_fused_topk.py 
pytest tests/kernels/moe/test_topk_softplus_sqrt.py 
pytest tests/kernels/moe/test_moe_align_block_size.py 
pytest tests/kernels/moe/test_grouped_topk.py 
pytest tests/kernels/moe/test_moe.py

Test Result

image image image image image image
Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Migration progress using the Audit Python extension torch-abi-audit:

main branch

  -- extensions --
    [UNSTABLE] [abi3-ok               ] _C.abi3.so  (stable_shim=0, unstable=77)
    [STABLE  ] [abi3-ok               ] _C_stable_libtorch.abi3.so  (stable_shim=81, unstable=0)
    [UNSTABLE] [abi3-ok               ] _flashmla_C.abi3.so  (stable_shim=0, unstable=72)
    [UNSTABLE] [abi3-ok               ] _flashmla_extension_C.abi3.so  (stable_shim=0, unstable=68) 
->[UNSTABLE] [abi3-ok               ] _moe_C.abi3.so  (stable_shim=0, unstable=82)
    [NO-TORCH] [abi3-ok               ] cumem_allocator.abi3.so
    [NO-TORCH] [abi3-ok               ] spinloop.abi3.so
    [UNSTABLE] [uses-private-api      ] third_party/deep_gemm/_C.cpython-312-x86_64-linux-gnu.so  (stable_shim=0, unstable=57)
    [UNSTABLE] [abi3-ok               ] vllm_flash_attn/_vllm_fa2_C.abi3.so  (stable_shim=0, unstable=84)
    [UNSTABLE] [abi3-ok               ] vllm_flash_attn/_vllm_fa3_C.abi3.so  (stable_shim=0, unstable=80)

This branch

  -- extensions --
    [UNSTABLE] [abi3-ok               ] _C.abi3.so  (stable_shim=0, unstable=73)
    [STABLE  ] [abi3-ok               ] _C_stable_libtorch.abi3.so  (stable_shim=86, unstable=0)
    [UNSTABLE] [abi3-ok               ] _flashmla_C.abi3.so  (stable_shim=0, unstable=72)
    [UNSTABLE] [abi3-ok               ] _flashmla_extension_C.abi3.so  (stable_shim=0, unstable=68)
->[STABLE  ] [abi3-ok               ] _moe_C_stable_libtorch.abi3.so  (stable_shim=70, unstable=0)
    [NO-TORCH] [abi3-ok               ] cumem_allocator.abi3.so
    [NO-TORCH] [abi3-ok               ] spinloop.abi3.so
    [UNSTABLE] [uses-private-api      ] third_party/deep_gemm/_C.cpython-312-x86_64-linux-gnu.so  (stable_shim=0, unstable=57)
    [UNSTABLE] [abi3-ok               ] vllm_flash_attn/_vllm_fa2_C.abi3.so  (stable_shim=0, unstable=84)
    [UNSTABLE] [abi3-ok               ] vllm_flash_attn/_vllm_fa3_C.abi3.so  (stable_shim=0, unstable=80)

moved all of _moe_C to STABLE ABI

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify Bot added ci/build nvidia rocm Related to AMD ROCm labels Jun 4, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD Jun 4, 2026
Comment thread csrc/libtorch_stable/moe/moe_permute_unpermute_op.cu Outdated
Comment thread csrc/libtorch_stable/moe/moe_permute_unpermute_op.cu Outdated
Comment thread csrc/libtorch_stable/moe/moe_permute_unpermute_op.cu Outdated
Comment thread csrc/libtorch_stable/moe/moe_permute_unpermute_op.cu
Comment thread csrc/libtorch_stable/moe/moe_permute_unpermute_op.cu Outdated
Comment thread csrc/libtorch_stable/moe/moe_permute_unpermute_op.cu Outdated
Comment thread csrc/libtorch_stable/moe/moe_permute_unpermute_op.cu Outdated
Comment thread csrc/libtorch_stable/moe/moe_permute_unpermute_op.cu Outdated

#include <cuda_bf16.h>
#include <cuda_runtime.h>

#include "core/registration.h"
#include "dsv3_router_gemm_utils.h"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete dsv3_router_gemm_utils.h as it now appears unused by anyone in the repo.

if i am wrong about that, then it'd be better to move it to stable and make the getSMVersion change in that file if it's used more widely.

} // namespace vllm

std::tuple<torch::Tensor, torch::Tensor> grouped_topk(
torch::Tensor const& scores, int64_t n_group, int64_t topk_group,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deleting const seems unintentional here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The migration tool uses the convention const <type>& instead of <type> const& for any rewrites like this. I checked some other files in vLLM and there doesn't seem to be a standard convention for this code base on which way to write (both appear often).

I can change it back (and in other places) if you think that helps ease the migration though.

Comment on lines +1036 to +1039
auto topk_values = torch::stable::new_empty(
scores, {num_tokens, topk}, torch::headeronly::ScalarType::Float);
auto topk_indices = torch::stable::new_empty(
scores, {num_tokens, topk}, torch::headeronly::ScalarType::Int);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these tensors should be on cuda, no?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

torch::stable::new_empty uses the device of scores which should be on cuda. From vllm/vllm/_custom_ops.py,

if not current_platform.is_cuda():
        raise NotImplementedError(
            "The fused grouped_topk kernel is only available on CUDA platforms"
        )

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I update this to be more explicit?

torch::stable::Tensor sorted_token_ids, torch::stable::Tensor experts_ids,
torch::stable::Tensor num_tokens_post_pad,
std::optional<torch::stable::Tensor> maybe_expert_map) {
const torch::stable::accelerator::DeviceGuard device_guard(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new

expert_map.data_ptr<int32_t>(), num_experts, block_size,
topk_ids.numel(), sorted_token_ids.size(0), topk_ids.size(1),
has_expert_map);
reinterpret_cast<const scalar_t*>(topk_ids.data_ptr()),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const or mutable?

#include "moeTopKFuncs.cuh"
#include <c10/cuda/CUDAStream.h>
#include <torch/all.h>
#include "moe/moeTopKFuncs.cuh"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file is all stable too, right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, moved it to the libtorch_stable directory.

@janeyx99 janeyx99 Jun 4, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this file should have changes to become stable

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

experts_per_warp, block_size, topk_ids.numel(),
cumsum_buffer.data_ptr<int32_t>(), sorted_token_ids.size(0),
topk_ids.size(1), has_expert_map);
reinterpret_cast<const scalar_t*>(topk_ids.data_ptr()),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, const or mutable

namespace {

inline int getSMVersion() {
auto* props = get_device_prop();

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and only this used to be defined in "dsv3_router_gemm_utils.h"

Signed-off-by: Chris Leonard <chleonar@redhat.com>
@cleonard530 cleonard530 marked this pull request as ready for review June 9, 2026 20:03
@cleonard530

Copy link
Copy Markdown
Contributor Author

@Harry-Chen, all of moe has been moved over to the stable ABI. I renamed the library _moe_c_stable_libtorch.so to emphasize that it is stable, but I can change the name back if I need to.

@cleonard530

Copy link
Copy Markdown
Contributor Author

@Harry-Chen and @janeyx99, to look at the diff for the files that GitHub has marked as deleted/created when really they are just moved, check out the commit a9a466d#diff-ae0d2f513cdf90dbeae0b924311373baacb889d29a201155b02b51b6d023ee51 (ignore CMakeLists.txt on that commit though, it got reverted to main to make the final conversion easier. You can just look at the total diff for that file)

@janeyx99 janeyx99 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, pls check headers comments tho

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are the headers in this now stable file:

#include "quantization/marlin/marlin.cuh"
#include "quantization/marlin/marlin_dtypes.cuh"
#include "core/scalar_type.hpp"

i think the latter 2 can be moved at least. and the first should as well?

@cleonard530 cleonard530 Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are still used by the kernel in csrc/quantization/marlin. This should be moved in the next PR so the headers should be moved over then.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same q for migrating these

#include "quantization/marlin/marlin.cuh"
#include "quantization/marlin/marlin_dtypes.cuh"
#include "quantization/marlin/dequant.h"
#include "quantization/marlin/marlin_mma.h"
#include "core/scalar_type.hpp"

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

}
STABLE_TORCH_LIBRARY_IMPL(_moe_C, CUDA, m) {
m.impl("moe_wna16_marlin_gemm", TORCH_BOX(&moe_wna16_marlin_gemm));
} No newline at end of file

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add new line back

Comment thread csrc/libtorch_stable/moe/moe_ops.h Outdated

#include <torch/csrc/stable/tensor.h>

#include "core/scalar_type.hpp"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might not be needed

…aderonly::ScalarType::Float8_e8m0fnu

Signed-off-by: Chris Leonard <chleonar@redhat.com>
@cleonard530

cleonard530 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

@Harry-Chen, the failures were because torch::headeronly::ScalarType::Float8_e8m0fnu is not supported on torch 2.10, so I moved the _moe_C_stable_libtorch library up to torch 2.11. Let me know if there are any issues with that.

@janeyx99

Copy link
Copy Markdown
Contributor

@Harry-Chen, the failures were because torch::headeronly::ScalarType::Float8_e8m0fnu is not supported on torch 2.10, so I moved the _moe_C_stable_libtorch library up to torch 2.11. Let me know if there are any issues with that.

It appears ROCm doesn't support 2.11 in CI yet, which may affect whether we need to handle this migration differently. @Harry-Chen Do you know if there's an ongoing effort to migrate ROCm CI and the timeline for it?

@cleonard530

Copy link
Copy Markdown
Contributor Author

@Harry-Chen, the failures were because torch::headeronly::ScalarType::Float8_e8m0fnu is not supported on torch 2.10, so I moved the _moe_C_stable_libtorch library up to torch 2.11. Let me know if there are any issues with that.

It appears ROCm doesn't support 2.11 in CI yet, which may affect whether we need to handle this migration differently. @Harry-Chen Do you know if there's an ongoing effort to migrate ROCm CI and the timeline for it?

I think the errors were caused by moe_wna16_marlin_gemm which is inside:

#ifndef USE_ROCM
  m.def(
      "moe_wna16_gemm(...

in csrc/libtorch_stable/moe/torch_bindings.cpp so I don't think we will have a problem with rocm here. I tried to parse the code to see if rocm uses float8_e8m0fnu or float8_e8m0fnu anywhere on the stable path and I couldn't find anything.

@Harry-Chen

Copy link
Copy Markdown
Member

@Harry-Chen, the failures were because torch::headeronly::ScalarType::Float8_e8m0fnu is not supported on torch 2.10, so I moved the _moe_C_stable_libtorch library up to torch 2.11. Let me know if there are any issues with that.

It appears ROCm doesn't support 2.11 in CI yet, which may affect whether we need to handle this migration differently. @Harry-Chen Do you know if there's an ongoing effort to migrate ROCm CI and the timeline for it?

I do not have information on this either. CC @AndreasKaratzas @tjtanaa who may know something on this

@github-project-automation github-project-automation Bot moved this from In review to Ready in NVIDIA Jun 10, 2026
@Harry-Chen Harry-Chen enabled auto-merge (squash) June 10, 2026 15:03
@cleonard530 cleonard530 changed the title [10c/n] Start Migrate MoE kernels to torch stable ABI Jun 10, 2026
@micah-wil

Copy link
Copy Markdown
Contributor

@Harry-Chen, the failures were because torch::headeronly::ScalarType::Float8_e8m0fnu is not supported on torch 2.10, so I moved the _moe_C_stable_libtorch library up to torch 2.11. Let me know if there are any issues with that.

It appears ROCm doesn't support 2.11 in CI yet, which may affect whether we need to handle this migration differently. @Harry-Chen Do you know if there's an ongoing effort to migrate ROCm CI and the timeline for it?

I do not have information on this either. CC @AndreasKaratzas @tjtanaa who may know something on this

Hi @Harry-Chen, we are currently giving 2.11 another round of testing after getting some fixes in. We'll bump as soon as we can- hopefully this week as long as things check out. Here's our latest torch 2.11 build in amd-ci: https://buildkite.com/vllm/amd-ci/builds/9408

@cleonard530

Copy link
Copy Markdown
Contributor Author

@Harry-Chen, none of these failures seems to be coming from this PR.

@ywang96 ywang96 disabled auto-merge June 11, 2026 06:02
@ywang96 ywang96 merged commit 6e64c1b into vllm-project:main Jun 11, 2026
10 of 19 checks passed
@github-project-automation github-project-automation Bot moved this from Todo to Done in AMD Jun 11, 2026
@github-project-automation github-project-automation Bot moved this from Ready to Done in NVIDIA Jun 11, 2026
Saddss pushed a commit to Saddss/vllm that referenced this pull request Jun 14, 2026
Signed-off-by: Chris Leonard <chleonar@redhat.com>
Co-authored-by: Shengqi Chen <harry-chen@outlook.com>
vivek8123 pushed a commit to odh-on-pz/vllm-upstream that referenced this pull request Jun 18, 2026
Signed-off-by: Chris Leonard <chleonar@redhat.com>
Co-authored-by: Shengqi Chen <harry-chen@outlook.com>
divineearthly pushed a commit to divineearthly/vllm that referenced this pull request Jun 19, 2026
Signed-off-by: Chris Leonard <chleonar@redhat.com>
Co-authored-by: Shengqi Chen <harry-chen@outlook.com>
Signed-off-by: divineearthly <divineearthly@gmail.com>
tunglinwood pushed a commit to tunglinwood/vllm that referenced this pull request Jun 22, 2026
Signed-off-by: Chris Leonard <chleonar@redhat.com>
Co-authored-by: Shengqi Chen <harry-chen@outlook.com>
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
Signed-off-by: Chris Leonard <chleonar@redhat.com>
Co-authored-by: Shengqi Chen <harry-chen@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build nvidia ready ONLY add when PR is ready to merge/full CI is needed rocm Related to AMD ROCm

5 participants