[Perf] fuse qk rmsnorm rope gate for qwen3.5#44176
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 410c9f80da
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
This looks like one of the first PRs to tackle the new RFC for putting fusions directly in model files. Given the size of the kernel, and how it affects the overall LoC's in the qwen_next.py. Would it be worth having a fusion, or kernel file that can hold these instances? I'm looking at this from a ROCm/Aiter perspective, where we have "aiter_ops" to track a lot of these kernels, and keep the model file focused more on the overall flow. What are people's thoughts here @SageMoore @robertgshaw2-redhat @WoosukKwon ? |
vadiklyutiy
left a comment
There was a problem hiding this comment.
LGTM
But only one thing. I'd propose to move triton kernel and its python interface to another file
ok, I'll refactor it. cc @dllehr-amd |
|
Hi @ZJY0516, the pre-commit checks have failed. Please run: uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-filesThen, commit the changes and push to your branch. For future commits, |
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> Signed-off-by: Ekagra Ranjan <3116519+ekagra-ranjan@users.noreply.github.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> Signed-off-by: Waqar Ahmed <waqar.ahmed@amd.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com> Signed-off-by: divineearthly <divineearthly@gmail.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Signed-off-by: zjy0516 <riverclouds.zhu@qq.com>
Purpose
combine split + QK-RMSNorm + partial RoPE + gate copy into one kernel launch
Ref: lightseekorg/tokenspeed#228
Test Plan
Test Result
main
PR
Accuracy
main
PR
GPQA
main: 0.8485
PR: 0.8485
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.