Skip to content

[CPU][RISC-V] Add RVV micro GEMM for WNA16#44324

Merged
bigPYJ1151 merged 6 commits into
vllm-project:mainfrom
wcynb1023:riscv-cpu_wna16_rvv
Jun 22, 2026
Merged

[CPU][RISC-V] Add RVV micro GEMM for WNA16#44324
bigPYJ1151 merged 6 commits into
vllm-project:mainfrom
wcynb1023:riscv-cpu_wna16_rvv

Conversation

@wcynb1023

@wcynb1023 wcynb1023 commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

This PR adds an RVV-specific micro GEMM implementation for CPU WNA16 on
RISC-V and wires the W4A16 GPTQ CPU path to use it.

Purpose

  1. Add an RVV-specific micro GEMM kernel for WNA16

    The existing VEC path already uses the generic vector abstraction on
    RISC-V, but it still follows the generic FP32Vec16 tile shape. For the
    WNA16 micro-kernel this creates higher register pressure on current RVV
    targets.

    This PR adds MicroGemm<ISA::RVV, scalar_t> with an RVV-specific inner
    kernel. The new kernel uses an internal Mx8 tile, keeps the external N=32
    packed weight layout compatible, uses scalar-vector FMA for the activation
    broadcast pattern, and unrolls K by 4.

  2. Wire W4A16 GPTQ to the RVV GEMM backend

    The W4A16 GPTQ CPU path now dispatches to
    MicroGemm<ISA::RVV, scalar_t> when isa_hint == "rvv". The dequantization
    path remains shared with the existing WNA16 implementation; only the micro
    GEMM backend changes after the packed B buffer is prepared.

    The Python W4A16 kernel dispatch was also verified to pass isa_hint="rvv"
    into ops.cpu_gemm_wna16 on RISC-V.

Follow-up

The current RVV micro GEMM tile shape is tuned for the tested VLEN=128 target.
A future optimization can select different tile sizes according to the target
VLEN.

Test Plan

1. RVV GEMM test

The synthetic benchmark directly calls the public CPU WNA16 op twice with the
same GPTQ WNA16 inputs and uses torch.profiler to measure the
cpu_wna16::gemm event:

  • ops.cpu_gemm_wna16(..., isa_hint="vec")
  • ops.cpu_gemm_wna16(..., isa_hint="rvv")

The outputs are compared before timing the GEMM profiler event. All tested
GPTQ WNA16 shapes have max|rvv-vec| = 0.

2. WNA16 dispatch test

The WNA16 model smoke tests were also run on RISC-V. The following
WNA16-related model cases passed:

OMP_NUM_THREADS=60 python -m pytest \
  tests/quantization/test_cpu_wna16.py \
  -vv -s --tb=short \
  -k "AWQ or GPTQ or w4a16 or int4"

test Result:

1. RVV GEMM test

Measured with GPTQ/AWQ WNA16, BF16 activation/output, no zero-points. The table
reports the cpu_wna16::gemm profiler event only. The benchmark uses
torch.profiler around the public ops.cpu_gemm_wna16 call and extracts the
GEMM event to compare the VEC and RVV micro GEMM backends while keeping the
same dequantization path and packed B-buffer layout.

K N M VEC GEMM time RVV GEMM time RVV/VEC GEMM speedup max|rvv-vec|
512 512 1 0.069 ms 0.026 ms 2.62x 0
512 512 2 0.084 ms 0.032 ms 2.60x 0
512 512 4 0.121 ms 0.050 ms 2.42x 0
512 512 8 0.240 ms 0.086 ms 2.78x 0
512 512 16 0.449 ms 0.169 ms 2.65x 0
512 512 32 0.879 ms 0.324 ms 2.71x 0
512 512 64 1.739 ms 0.644 ms 2.70x 0
512 512 96 2.595 ms 0.987 ms 2.63x 0
512 512 128 3.490 ms 1.297 ms 2.69x 0
512 512 192 5.290 ms 2.300 ms 2.30x 0
512 512 256 6.910 ms 2.725 ms 2.54x 0
512 512 512 14.141 ms 5.849 ms 2.42x 0
2048 2048 1 1.344 ms 0.418 ms 3.22x 0
2048 2048 2 1.657 ms 0.515 ms 3.22x 0
2048 2048 4 2.411 ms 0.870 ms 2.77x 0
2048 2048 8 4.432 ms 1.613 ms 2.75x 0
2048 2048 16 8.999 ms 3.170 ms 2.84x 0
2048 2048 32 17.997 ms 6.423 ms 2.80x 0
2048 2048 64 35.880 ms 12.552 ms 2.86x 0
2048 2048 96 53.947 ms 18.789 ms 2.87x 0
2048 2048 128 71.864 ms 25.235 ms 2.85x 0
2048 2048 192 108.073 ms 37.706 ms 2.87x 0
2048 2048 256 144.714 ms 52.212 ms 2.77x 0
2048 2048 512 290.626 ms 109.825 ms 2.65x 0
2048 2048 1024 580.830 ms 220.166 ms 2.64x 0

2. tests/quantization/test_cpu_wna16.py

  • TheBloke/TinyLlama-1.1B-Chat-v1.0-AWQ
  • TheBloke/TinyLlama-1.1B-Chat-v1.0-GPTQ
  • Qwen/Qwen1.5-0.5B-Chat-GPTQ-Int4
  • RedHatAI/Qwen3-1.7B-quantized.w4a16
  • OPEA/Qwen2.5-0.5B-Instruct-int4-sym-inc
5 passed

The remaining full-file failures were from FP8/MXFP4 model cases, which are
outside the W4A16 GPTQ RVV scope of this PR.

wcynb1023 added 4 commits June 1, 2026 16:08
Add RVV as a new ISA type and wire it through the WNA16 GEMM dispatch.   RVV reuses the VEC micro-kernel implementation and follows the same   dequantize path (has_zp / use_desc_act) as x86 AMX/VEC. On the Python   side, detect RISC-V architecture and return "rvv" as the ISA hint,   skipping AMX detection.



Co-authored-by: lyd1992 <liuyudong@iscas.ac.cn>

Co-authored-by: wcy2003 <233313160abc@gmail.com>
Signed-off-by: wcy <233313160abc@gmail.com>
Add a dedicated RVV MicroGemm implementation for WNA16. The RVV path uses an Mx8 micro-kernel with K unrolled by 4 and scalar-vector RVV FMA, while keeping the existing packed B layout and MicroGemm interface.



Co-authored-by: lyd1992 <liuyudong@iscas.ac.cn>

Co-authored-by: wcy2003 <233313160abc@gmail.com>
Signed-off-by: wcy <233313160abc@gmail.com>
Co-authored-by: lyd1992 <liuyudong@iscas.ac.cn>

Co-authored-by: wcy2003 <233313160abc@gmail.com>
Signed-off-by: wcy <233313160abc@gmail.com>
Co-authored-by: lyd1992 <liuyudong@iscas.ac.cn>

Co-authored-by: wcy2003 <233313160abc@gmail.com>
Signed-off-by: wcy <233313160abc@gmail.com>
@wcynb1023 wcynb1023 requested a review from bigPYJ1151 as a code owner June 2, 2026 13:50
@mergify mergify Bot added the cpu Related to CPU backends label Jun 2, 2026
@wcynb1023

Copy link
Copy Markdown
Contributor Author

Hi @bigPYJ1151, gentle ping when you have a chance to review this PR.

This adds an RVV-specific MicroGemm<ISA::RVV, scalar_t> backend for CPU WNA16 on RISC-V and wires the W4A16 GPTQ CPU path to dispatch to it via isa_hint="rvv". The PR keeps the existing dequantization path and packed B-buffer layout unchanged, so the scope is limited to the micro GEMM backend after packing.

I also added test/benchmark results in the PR description: the RVV path matches the existing VEC path with max|rvv-vec| = 0, shows roughly 2.4x-3.2x GEMM speedup on the tested VLEN=128 target, and the WNA16 model smoke tests passed for the listed AWQ/GPTQ/int4 cases.

Thanks!

@wcynb1023

Copy link
Copy Markdown
Contributor Author

Hi, @bigPYJ1151
Could you please take a look at this PR when you have a moment? I’d really appreciate your help.

@bigPYJ1151 bigPYJ1151 enabled auto-merge (squash) June 22, 2026 10:44
@github-actions github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 22, 2026
@bigPYJ1151 bigPYJ1151 merged commit d2c671c into vllm-project:main Jun 22, 2026
77 of 78 checks passed
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
Signed-off-by: wcy <233313160abc@gmail.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
qli88 pushed a commit to qli88/vllm that referenced this pull request Jun 26, 2026
Signed-off-by: wcy <233313160abc@gmail.com>
Co-authored-by: Li, Jiang <jiang1.li@intel.com>
Signed-off-by: Qiang Li <qiang.li2@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cpu Related to CPU backends ready ONLY add when PR is ready to merge/full CI is needed

2 participants