[CPU][RISC-V] Add RVV micro GEMM for WNA16 by wcynb1023 · Pull Request #44324 · vllm-project/vllm

wcynb1023 · 2026-06-02T13:50:28Z

This PR adds an RVV-specific micro GEMM implementation for CPU WNA16 on
RISC-V and wires the W4A16 GPTQ CPU path to use it.

Purpose

Add an RVV-specific micro GEMM kernel for WNA16

The existing VEC path already uses the generic vector abstraction on
RISC-V, but it still follows the generic FP32Vec16 tile shape. For the
WNA16 micro-kernel this creates higher register pressure on current RVV
targets.

This PR adds MicroGemm<ISA::RVV, scalar_t> with an RVV-specific inner
kernel. The new kernel uses an internal Mx8 tile, keeps the external N=32
packed weight layout compatible, uses scalar-vector FMA for the activation
broadcast pattern, and unrolls K by 4.
Wire W4A16 GPTQ to the RVV GEMM backend

The W4A16 GPTQ CPU path now dispatches to
MicroGemm<ISA::RVV, scalar_t> when isa_hint == "rvv". The dequantization
path remains shared with the existing WNA16 implementation; only the micro
GEMM backend changes after the packed B buffer is prepared.

The Python W4A16 kernel dispatch was also verified to pass isa_hint="rvv"
into ops.cpu_gemm_wna16 on RISC-V.

Follow-up

The current RVV micro GEMM tile shape is tuned for the tested VLEN=128 target.
A future optimization can select different tile sizes according to the target
VLEN.

Test Plan

1. RVV GEMM test

The synthetic benchmark directly calls the public CPU WNA16 op twice with the
same GPTQ WNA16 inputs and uses torch.profiler to measure the
cpu_wna16::gemm event:

ops.cpu_gemm_wna16(..., isa_hint="vec")
ops.cpu_gemm_wna16(..., isa_hint="rvv")

The outputs are compared before timing the GEMM profiler event. All tested
GPTQ WNA16 shapes have max|rvv-vec| = 0.

2. WNA16 dispatch test

The WNA16 model smoke tests were also run on RISC-V. The following
WNA16-related model cases passed:

OMP_NUM_THREADS=60 python -m pytest \
  tests/quantization/test_cpu_wna16.py \
  -vv -s --tb=short \
  -k "AWQ or GPTQ or w4a16 or int4"

test Result:

1. RVV GEMM test

Measured with GPTQ/AWQ WNA16, BF16 activation/output, no zero-points. The table
reports the cpu_wna16::gemm profiler event only. The benchmark uses
torch.profiler around the public ops.cpu_gemm_wna16 call and extracts the
GEMM event to compare the VEC and RVV micro GEMM backends while keeping the
same dequantization path and packed B-buffer layout.

K	N	M	VEC GEMM time	RVV GEMM time	RVV/VEC GEMM speedup
512	512	1	0.069 ms	0.026 ms	2.62x
512	512	2	0.084 ms	0.032 ms	2.60x
512	512	4	0.121 ms	0.050 ms	2.42x
512	512	8	0.240 ms	0.086 ms	2.78x
512	512	16	0.449 ms	0.169 ms	2.65x
512	512	32	0.879 ms	0.324 ms	2.71x
512	512	64	1.739 ms	0.644 ms	2.70x
512	512	96	2.595 ms	0.987 ms	2.63x
512	512	128	3.490 ms	1.297 ms	2.69x
512	512	192	5.290 ms	2.300 ms	2.30x
512	512	256	6.910 ms	2.725 ms	2.54x
512	512	512	14.141 ms	5.849 ms	2.42x
2048	2048	1	1.344 ms	0.418 ms	3.22x
2048	2048	2	1.657 ms	0.515 ms	3.22x
2048	2048	4	2.411 ms	0.870 ms	2.77x
2048	2048	8	4.432 ms	1.613 ms	2.75x
2048	2048	16	8.999 ms	3.170 ms	2.84x
2048	2048	32	17.997 ms	6.423 ms	2.80x
2048	2048	64	35.880 ms	12.552 ms	2.86x
2048	2048	96	53.947 ms	18.789 ms	2.87x
2048	2048	128	71.864 ms	25.235 ms	2.85x
2048	2048	192	108.073 ms	37.706 ms	2.87x
2048	2048	256	144.714 ms	52.212 ms	2.77x
2048	2048	512	290.626 ms	109.825 ms	2.65x
2048	2048	1024	580.830 ms	220.166 ms	2.64x

2. tests/quantization/test_cpu_wna16.py

TheBloke/TinyLlama-1.1B-Chat-v1.0-AWQ
TheBloke/TinyLlama-1.1B-Chat-v1.0-GPTQ
Qwen/Qwen1.5-0.5B-Chat-GPTQ-Int4
RedHatAI/Qwen3-1.7B-quantized.w4a16
OPEA/Qwen2.5-0.5B-Instruct-int4-sym-inc

5 passed

The remaining full-file failures were from FP8/MXFP4 model cases, which are
outside the W4A16 GPTQ RVV scope of this PR.

Add RVV as a new ISA type and wire it through the WNA16 GEMM dispatch. RVV reuses the VEC micro-kernel implementation and follows the same dequantize path (has_zp / use_desc_act) as x86 AMX/VEC. On the Python side, detect RISC-V architecture and return "rvv" as the ISA hint, skipping AMX detection. Co-authored-by: lyd1992 <liuyudong@iscas.ac.cn> Co-authored-by: wcy2003 <233313160abc@gmail.com> Signed-off-by: wcy <233313160abc@gmail.com>

Add a dedicated RVV MicroGemm implementation for WNA16. The RVV path uses an Mx8 micro-kernel with K unrolled by 4 and scalar-vector RVV FMA, while keeping the existing packed B layout and MicroGemm interface. Co-authored-by: lyd1992 <liuyudong@iscas.ac.cn> Co-authored-by: wcy2003 <233313160abc@gmail.com> Signed-off-by: wcy <233313160abc@gmail.com>

Co-authored-by: lyd1992 <liuyudong@iscas.ac.cn> Co-authored-by: wcy2003 <233313160abc@gmail.com> Signed-off-by: wcy <233313160abc@gmail.com>

wcynb1023 · 2026-06-12T12:19:31Z

Hi @bigPYJ1151, gentle ping when you have a chance to review this PR.

This adds an RVV-specific MicroGemm<ISA::RVV, scalar_t> backend for CPU WNA16 on RISC-V and wires the W4A16 GPTQ CPU path to dispatch to it via isa_hint="rvv". The PR keeps the existing dequantization path and packed B-buffer layout unchanged, so the scope is limited to the micro GEMM backend after packing.

I also added test/benchmark results in the PR description: the RVV path matches the existing VEC path with max|rvv-vec| = 0, shows roughly 2.4x-3.2x GEMM speedup on the tested VLEN=128 target, and the WNA16 model smoke tests passed for the listed AWQ/GPTQ/int4 cases.

Thanks!

wcynb1023 · 2026-06-15T12:05:25Z

Hi, @bigPYJ1151
Could you please take a look at this PR when you have a moment? I’d really appreciate your help.

Signed-off-by: wcy <233313160abc@gmail.com> Co-authored-by: Li, Jiang <jiang1.li@intel.com>

Signed-off-by: wcy <233313160abc@gmail.com> Co-authored-by: Li, Jiang <jiang1.li@intel.com> Signed-off-by: Qiang Li <qiang.li2@amd.com>

wcynb1023 added 4 commits June 1, 2026 16:08

Move RVV micro-GEMM include to CPU WNA16 dispatch

6e56c88

Co-authored-by: lyd1992 <liuyudong@iscas.ac.cn> Co-authored-by: wcy2003 <233313160abc@gmail.com> Signed-off-by: wcy <233313160abc@gmail.com>

fix format

05c71d7

Co-authored-by: lyd1992 <liuyudong@iscas.ac.cn> Co-authored-by: wcy2003 <233313160abc@gmail.com> Signed-off-by: wcy <233313160abc@gmail.com>

wcynb1023 requested a review from bigPYJ1151 as a code owner June 2, 2026 13:50

mergify Bot added the cpu Related to CPU backends label Jun 2, 2026

Merge branch 'main' into riscv-cpu_wna16_rvv

96f85a2

bigPYJ1151 approved these changes Jun 22, 2026

View reviewed changes

Merge branch 'main' into riscv-cpu_wna16_rvv

2e137b2

bigPYJ1151 enabled auto-merge (squash) June 22, 2026 10:44

github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 22, 2026

bigPYJ1151 merged commit d2c671c into vllm-project:main Jun 22, 2026
77 of 78 checks passed

nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026

[CPU][RISC-V] Add RVV micro GEMM for WNA16 (vllm-project#44324)

4d9c238

Signed-off-by: wcy <233313160abc@gmail.com> Co-authored-by: Li, Jiang <jiang1.li@intel.com>

qli88 pushed a commit to qli88/vllm that referenced this pull request Jun 26, 2026

[CPU][RISC-V] Add RVV micro GEMM for WNA16 (vllm-project#44324)

a66c20a

Signed-off-by: wcy <233313160abc@gmail.com> Co-authored-by: Li, Jiang <jiang1.li@intel.com> Signed-off-by: Qiang Li <qiang.li2@amd.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[CPU][RISC-V] Add RVV micro GEMM for WNA16#44324

[CPU][RISC-V] Add RVV micro GEMM for WNA16#44324
bigPYJ1151 merged 6 commits into
vllm-project:mainfrom
wcynb1023:riscv-cpu_wna16_rvv

wcynb1023 commented Jun 2, 2026 •

edited

Loading

wcynb1023 commented Jun 12, 2026

wcynb1023 commented Jun 15, 2026

Uh oh!

Labels

2 participants

Uh oh!

Uh oh!

Conversation

wcynb1023 commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Follow-up

Test Plan

1. RVV GEMM test

2. WNA16 dispatch test

test Result:

1. RVV GEMM test

2. tests/quantization/test_cpu_wna16.py

wcynb1023 commented Jun 12, 2026

wcynb1023 commented Jun 15, 2026

Uh oh!

Labels

2 participants

wcynb1023 commented Jun 2, 2026 •

edited

Loading