[Perf][DSv4/DSv3.2] Add cluster-cooperative topK kernel for low-latency scenarios by LopezCastroRoberto · Pull Request #43008 · vllm-project/vllm

LopezCastroRoberto · 2026-05-18T18:56:24Z

Summary

Adds a cluster-cooperative topK kernel for low-latency cases, which uses TMA and DSMEM, meaning that SM90+ arch is required. This new kernel has been extensively tuned to cover the whole low-latency regime (i.e. bs ≤32) with a cluster-level cooperation via cluster.sync() and distributed SMEM histogram reduction, eliminating the complexity of the persistent scheduler and multi-CTA spin-barrier coordination in persistent_topK v1 (PR #37421).

This is also expected to improve GPU contention since other streams are running in parallel — the persistent scheduler in topK v1, for some configs, occupied all the GPU resources, starving concurrent work on other streams. This new version avoids the headroom pre-allocation that was needed to prevent the persistent kernel from deadlocking under occupancy pressure.

Approach

Additional features added to this algorithm by this PR — bs≤32:

Twopass streaming fallback — double-buffered TMA streaming. This enables correct low-latency topK for arbitrarily long sequences (verified up to 1M).
Cluster size tuning for short and medium batch sizes - CS=8 for bs≤8 (maximize per-row parallelism), CS=4 for bs 9–32 (balance row parallelism with SM utilization across rows)
histogram_4096_topk<12> - evolved from v1's histogram_2048_topk, widened to 4096-bin coarse histogram with warp-ballot tie-breaking for ≤64 ties, eliminating most radix refinement rounds.
redux.sync.add hardware warp reduce, replacing the __shfl_xor_sync butterfly tree with a single PTX instruction for warp-wide reduction.
Non-persistent scheduling — 1 cluster per row, no persistent loop or SM over-reservation. SMs are released as each row completes.

For bs>32, we inherited FilteredTopK from topK v1 (PR #37421):

Added histogram_4096_topk<12, 8> fast path for sl ≤ 32K (32 floats per thread, 4096-bin histogram)
Large-path untouched w.r.t. PR [Perf][Kernel] Persistent TopK scheduler: unified CUDAGraph-safe kernel with dynamic per-row dispatch - DeepSeek-V3.2 DSA decode #37421.

Architecture of topK v2

  ┌─────────────────┬──────────────────────────┬────────────────────────────────────────────────────────┐
  │      Path       │        Condition         │                       Mechanism                        │
  ├─────────────────┼──────────────────────────┼────────────────────────────────────────────────────────┤
  │ Histogram 4096  │ sl ≤ 16K                 │ Warp-register histogram, no TMA                        │
  ├─────────────────┼──────────────────────────┼────────────────────────────────────────────────────────┤
  │ Fused (CS=4/8)  │ sl/CS ≤ TMA stages × 16K │ All TMA stages resident, single-pass histogram+scatter │
  ├─────────────────┼──────────────────────────┼────────────────────────────────────────────────────────┤
  │ Twopass (CS=4)  │ otherwise                │ TMA double-buffer streaming, two passes                │
  ├─────────────────┼──────────────────────────┼────────────────────────────────────────────────────────┤
  │ FilteredTopK    │ bs > 32                  │ 1 CTA per row, inherited from topK v1                  │
  └─────────────────┴──────────────────────────┴────────────────────────────────────────────────────────┘

Microbenchmarking - vLLM topK v2 vs. v1 (B300)

topK=512:

┌──────┬──────┬──────┬──────┬───────┬───────┬───────┬────────┐
│  sl  │ bs=1 │ bs=4 │ bs=8 │ bs=16 │ bs=32 │ bs=64 │ bs=128 │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 2K   │ 2.07 │ 2.41 │ 2.26 │ 2.33  │ 2.33  │ 2.03  │ 1.97   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 4K   │ 2.10 │ 2.24 │ 2.27 │ 2.24  │ 2.24  │ 2.00  │ 2.26   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 8K   │ 2.12 │ 2.05 │ 1.98 │ 1.93  │ 2.13  │ 2.11  │ 2.09   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 16K  │ 1.61 │ 1.92 │ 1.75 │ 1.71  │ 1.73  │ 1.81  │ 1.81   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 32K  │ 1.71 │ 1.65 │ 1.77 │ 1.75  │ 1.74  │ 1.69  │ 1.62   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────��───┤
│ 65K  │ 1.82 │ 2.08 │ 2.27 │ 2.66  │ 2.57  │ 1.00  │ 1.00   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 128K │ 1.79 │ 1.88 │ 2.03 │ 2.41  │ 2.32  │ 1.00  │ 1.00   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 262K │ 1.44 │ 1.44 │ 2.96 │ 1.62  │ 3.04  │ 1.00  │ 1.00   │
└──────┴──────┴──────┴──────┴───────┴───────┴───────┴────────┘

topK=1024:

┌──────┬──────┬──────┬──────┬───────┬───────┬───────┬────────┐
│  sl  │ bs=1 │ bs=4 │ bs=8 │ bs=16 │ bs=32 │ bs=64 │ bs=128 │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 2K   │ 1.48 │ 1.43 │ 2.46 │ 2.38  │ 1.97  │ 1.94  │ 2.00   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 4K   │ 2.31 │ 2.21 │ 2.24 │ 2.27  │ 2.27  │ 2.29  │ 1.58   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 8K   │ 1.98 │ 1.98 │ 2.13 │ 1.91  │ 1.86  │ 2.04  │ 2.10   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 16K  │ 1.68 │ 1.70 │ 1.69 │ 1.81  │ 1.67  │ 1.81  │ 1.85   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 32K  │ 1.89 │ 1.80 │ 1.80 │ 1.90  │ 1.84  │ 1.69  │ 1.69   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 65K  │ 1.77 │ 2.10 │ 2.26 │ 2.58  │ 2.55  │ 1.01  │ 0.99   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 128K │ 1.73 │ 1.78 │ 1.96 │ 2.34  │ 2.28  │ 1.03  │ 1.01   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 262K │ 1.42 │ 1.42 │ 2.90 │ 1.59  │ 3.00  │ 1.00  │ 1.00   │
└──────┴──────┴──────┴──────┴───────┴───────┴───────┴────────┘

topK=2048:

┌──────┬──────┬──────┬──────┬───────┬───────┬───────┬────────┐
│  sl  │ bs=1 │ bs=4 │ bs=8 │ bs=16 │ bs=32 │ bs=64 │ bs=128 │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 2K   │ 1.36 │ 1.36 │ 1.33 │ 1.33  │ 1.33  │ 1.00  │ 1.00   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 4K   │ 1.48 │ 2.00 │ 2.19 │ 2.25  │ 2.19  │ 2.00  │ 1.97   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 8K   │ 2.05 │ 2.02 │ 1.98 │ 1.91  │ 1.94  │ 2.32  │ 2.10   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 16K  │ 1.77 │ 1.86 │ 1.72 │ 1.33  │ 1.35  │ 1.35  │ 1.42   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 32K  │ 1.89 │ 1.77 │ 1.80 │ 1.86  │ 1.78  │ 1.46  │ 1.40   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 65K  │ 1.91 │ 1.87 │ 2.03 │ 2.42  │ 2.36  │ 1.00  │ 1.00   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 128K │ 1.56 │ 1.61 │ 1.78 │ 2.14  │ 2.07  │ 1.63  │ 1.00   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 262K │ 1.27 │ 1.28 │ 2.64 │ 1.48  │ 2.79  │ 0.74  │ 1.00   │
└──────┴──────┴──────┴──────┴───────┴───────┴───────┴────────┘

E2E results (B300)

vllm serve deepseek-ai/DeepSeek-V4-Flash -tp 4 --kv-cache-dtype fp8

vllm bench serve --model deepseek-ai/DeepSeek-V4-Flash --input-len 512000 --output-len 2048 --num-prompts 8 --max-concurrency 1

MAIN:

============ Serving Benchmark Result ============
Successful requests:                     8         
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  279.07    
Total input tokens:                      4096000   
Total generated tokens:                  16384     
Request throughput (req/s):              0.03      
Output token throughput (tok/s):         58.71     
Peak output token throughput (tok/s):    126.00    
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          14736.17  
---------------Time to First Token----------------
Mean TTFT (ms):                          18711.25  
Median TTFT (ms):                        19996.18  
P99 TTFT (ms):                           27108.63  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.90      
Median TPOT (ms):                        7.90      
P99 TPOT (ms):                           7.91      
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.99      
Median ITL (ms):                         7.94      
P99 ITL (ms):                            8.14      
==================================================

PR:

============ Serving Benchmark Result ============
Successful requests:                     8         
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  260.05    
Total input tokens:                      4096000   
Total generated tokens:                  16384     
Request throughput (req/s):              0.03      
Output token throughput (tok/s):         63.00     
Peak output token throughput (tok/s):    140.00    
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          15813.94  
---------------Time to First Token----------------
Mean TTFT (ms):                          17825.19  
Median TTFT (ms):                        20064.59  
P99 TTFT (ms):                           20086.67  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.17      
Median TPOT (ms):                        7.17      
P99 TPOT (ms):                           7.24      
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.26      
Median ITL (ms):                         7.15      
P99 ITL (ms):                            9.00      
==================================================

~10% TPOT improvement

E2E results with concurrency=1 for increasing ISL - Averaged on 3 runs on B300

  ┌───────┬─────────┬───────────┬─────────┬────────┬──────────────────┬────────────────┬───────────────┐
  │ Input │ topK SL │ MAIN TPOT │ PR TPOT │ TPOT Δ │ MAIN total tok/s │ PR total tok/s │ total tok/s Δ │
  ├───────┼─────────┼───────────┼─────────┼────────┼──────────────────┼────────────────┼───────────────┤
  │ 32K   │ 8K      │ 7.28      │ 7.03    │ +3.6%  │ 2261             │ 2340           │ +3.5%         │
  ├───────┼─────────┼───────────┼─────────┼────────┼──────────────────┼────────────────┼───────────────┤
  │ 64K   │ 16K     │ 7.29      │ 7.19    │ +1.5%  │ 4338             │ 4406           │ +1.6%         │
  ├───────┼─────────┼───────────┼─────────┼────────┼──────────────────┼────────────────┼───────────────┤
  │ 128K  │ 32K     │ 7.46      │ 7.32    │ +1.9%  │ 7132             │ 7250           │ +1.7%         │
  ├───────┼─────────┼───────────┼─────────┼────────┼──────────────────┼────────────────┼───────────────┤
  │ 256K  │ 64K     │ 7.71      │ 7.32    │ +5.3%  │ 11097            │ 11689          │ +5.3%         │
  ├───────┼─────────┼───────────┼─────────┼────────┼──────────────────┼────────────────┼───────────────┤
  │ 512K  │ 128K    │ 7.90      │ 7.35    │ +7.5%  │ 14943            │ 15533          │ +4.0%         │
  ├───────┼─────────┼───────────┼─────────┼────────┼──────────────────┼────────────────┼───────────────┤
  │ 1M    │ 250K    │ 8.21      │ 8.04    │ +2.1%  │ 14633            │ 14715          │ +0.6%         │
  └───────┴─────────┴───────────┴─────────┴────────┴──────────────────┴────────────────┴───────────────┘

#Conclusions: TPOT is super flat for ISL 32K up to 512K - note that the difference is 7.1ms vs 7.3ms. For 1M it increases a bit more than expected - should probably be studied separately. These conclusions match the results in the microbenchmarks (topK=512): while for 32-128K the topK v2 kernel gets 70-80% perf improvement w.r.t. v1, for 262K the improvement is ~40%.

UPDATE: Check-out the latest follow-up comments on this PR below.

Accuracy

GSM8K

python tests/evals/gsm8k/gsm8k_eval.py

MAIN:

Results:
Accuracy: 0.948
Invalid responses: 0.000
Total latency: 37.024 s
Questions per second: 35.625
Total output tokens: 116954
Output tokens per second: 3158.851

PR:

Results:
Accuracy: 0.948
Invalid responses: 0.000
Total latency: 37.334 s
Questions per second: 35.330
Total output tokens: 117784
Output tokens per second: 3154.878

MRCR 2-needle eval — MAIN vs PR (DeepSeek-V4-Flash, B300, TP=4)


  ┌──────────┬────────────┬──────────┬─────────┐
  │  Bucket  │ MAIN score │ PR score │    Δ    │
  ├──────────┼────────────┼──────────┼─────────┤
  │ 0-8K     │ 0.9885     │ 0.9880   │ -0.0005 │
  ├──────────┼────────────┼──────────┼─────────┤
  │ 16K-32K  │ 0.9209     │ 0.9303   │ +0.0094 │
  ├──────────┼────────────┼──────────┼─────────┤
  │ 32K-64K  │ 0.9194     │ 0.9196   │ +0.0002 │
  ├──────────┼────────────┼──────────┼─────────┤
  │ 64K-128K │ 0.9392     │ 0.9593   │ +0.0201 │
  └──────────┴────────────┴──────────┴─────────┘

Potential TODOs:

Stride alignment -- TMA requires logits.stride(0) % 4 == 0, enforced via TORCH_CHECK rn. Always true when stride = max_model_len from model config, but odd user-supplied --max-model-len would crash. Decide whether to keep topK v1 as fallback, pad in the dispatcher, or just trigger TORCH_CHECK for non multiple of 4 --max-model-len configs.
Long-context evals — Run MRCR eval at long context to verify e2e correctness beyond kernel-level tests and GSM8K evals. Implement openai/mrcr long context evaluation benchmark EleutherAI/lm-evaluation-harness#3754 is perfect for that.
Cross-platform benchmarks — Repeat benchmakrs on B200 and H200.
(Optional) Run e2e benchmarks on DSv4-Pro too.

gemini-code-assist

Code Review

This pull request introduces a cluster-persistent TopK implementation for SM90+ architectures, leveraging TMA and DSMEM for improved performance at K=1024. The review feedback identifies critical correctness issues related to potential out-of-bounds memory accesses during TMA load operations due to incorrect size calculations. It also suggests using unsigned integers for counters to improve type safety and consistency.

gemini-code-assist · 2026-05-18T19:01:31Z

+    for (uint32_t i = 0; i < kNumStages8; i++) {
+      if (i >= num_iters) break;
+      const auto off = i * kSizePerStage;
+      const auto sz = min(kSizePerStage, len_aligned - off) * sizeof(float);


The tma_load operation uses len_aligned - off to determine the size of the data to load. len_aligned is the my_len value rounded up to the nearest multiple of 4. If my_len is not a multiple of 4, len_aligned will be greater than my_len. This can lead to tma_load attempting to read beyond the actual data available (my_len), resulting in an out-of-bounds memory access. The size should be capped by the actual my_len.

For example, if my_len = 5 and kAlign = 4, then len_aligned = 8. If off = 0, sz would be calculated based on 8 - 0 = 8 elements, but only 5 are valid. This is a critical correctness issue.

const auto sz = min(kSizePerStage, my_len - off) * sizeof(float);

gemini-code-assist · 2026-05-18T19:01:31Z

+    for (uint32_t i = 0; i < kNumStages4; i++) {
+      if (i >= ni) break;
+      const auto o = i * kSizePerStage;
+      const auto sz = min(kSizePerStage, la - o) * sizeof(float);


Similar to the issue on line 505, the tma_load in stream_pass4 uses la - o to determine the size. la is the length value rounded up to the nearest multiple of 4. This can cause an out-of-bounds read if length is not a multiple of 4, as tma_load might attempt to read beyond the actual length of the data. The size should be capped by the actual length.

This is a critical correctness issue.

const auto sz = min(kSizePerStage, length - o) * sizeof(float);

gemini-code-assist · 2026-05-18T19:01:31Z

+  const auto u = (sl + kA-1)/kA, b = u/CS, e = u%CS;
+  const auto lu = b + (rank < e ? 1u : 0u);
+  const auto ou = rank * b + min(rank, e);
+  const auto ms = ou * kA, ml = min(ms + lu * kA, sl) - ms;


This line in large_topk_twopass4 appears to be a copy-paste error from large_topk_fused8. It incorrectly uses len_aligned - off (which is not defined in this scope and would be a compilation error if not for len_aligned being defined in large_topk_fused8 but not here, leading to potential undefined behavior or a compiler error depending on context). It should use ml - off, which represents the actual length of the current partition. Using an undefined or incorrect variable for tma_load size is a critical correctness issue.

const auto sz = min(kSizePerStage, ml - off) * sizeof(float);

gemini-code-assist · 2026-05-18T19:01:31Z

+
+struct alignas(16) MatchBin { uint32_t bin, above_count, equal_count; };
+struct alignas(8) Tie { uint32_t idx; float score; };
+struct ClusterState { int output_counter; };


The output_counter in ClusterState is declared as an int. While unlikely to overflow with current K and kMaxTies values, it's generally safer and more consistent to use uint32_t for counters that are always non-negative, especially when uint32_t values (la, le) are cast to int before being added atomically. This prevents any potential issues if la or le were to exceed INT_MAX in future modifications.

struct ClusterState { uint32_t output_counter; };

mergify · 2026-05-27T15:49:29Z

Hi @LopezCastroRoberto, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-05-27T16:14:18Z

Hi @LopezCastroRoberto, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-05-27T16:29:41Z

Hi @LopezCastroRoberto, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

mergify · 2026-05-27T16:54:28Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LopezCastroRoberto.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

LopezCastroRoberto · 2026-05-28T17:40:51Z

Update

Seems like increasing the Cluster Size to 16 for bs<=4 potentially improves the long-context case significantly (more per-row parallelism, which becomes critical for long rows).

On Blackwell:

The maximum portable cluster size supported is 8; however, NVIDIA Blackwell B200 GPU allows for a nonportable cluster size of 16 by opting in

src: https://docs.nvidia.com/cuda/blackwell-tuning-guide/index.html#thread-block-clusters

Microbenchmarks

  CS=8 (bs≤4) vs v1                    CS=16 (bs≤4) vs v1
  ┌──────┬──────┬──────┐               ┌──────┬──────┬──────┐
  │  sl  │ bs=1 │ bs=4 │               │  sl  │ bs=1 │ bs=4 │
  ├──────┼──────┼──────┤               ├──────┼──────┼──────┤
  │ 2K   │ 1.48 │ 1.43 │               │ 2K   │ 2.31 │ 1.46 │
  │ 4K   │ 2.31 │ 2.21 │               │ 4K   │ 2.45 │ 2.27 │
  │ 8K   │ 1.98 │ 1.98 │               │ 8K   │ 2.35 │ 2.00 │
  │ 16K  │ 1.68 │ 1.70 │               │ 16K  │ 1.77 │ 1.71 │
  │ 32K  │ 1.89 │ 1.80 │               │ 32K  │ 1.79 │ 1.67 │
  │ 65K  │ 1.77 │ 2.10 │               │ 65K  │ 2.12 │ 2.04 │
  │ 128K │ 1.73 │ 1.78 │               │ 128K │ 1.93 │ 1.90 │
  │ 262K │ 1.42 │ 1.42 │               │ 262K │ 1.70 │ 1.66 │
  └──────┴──────┴──────┘               └──────┴──────┴──────┘

Port cooperative cluster top-k kernels and launchers to csrc/libtorch_stable/, gate registration with VLLM_ENABLE_COOPERATIVE_TOPK, and route decode sparse indexer to cooperative_topk when eligible. Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

mergify · 2026-06-19T16:47:57Z

Hi @LopezCastroRoberto, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

mergify · 2026-06-23T12:12:11Z

Hi @LopezCastroRoberto, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

…cy scenarios (#43008) Signed-off-by: LopezCastroRoberto <rocastro@redhat.com> (cherry picked from commit 855cd4d)

…cy scenarios (vllm-project#43008) Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

eugr · 2026-06-25T18:38:54Z

@LopezCastroRoberto - this PR breaks DeepSeek V4 Flash on DGX Spark (sm120).

It fails during startup CUDA graph memory profiling.

The crash is in the sparse attention indexer path added/changed by this PR:

sparse_attn_indexer -> torch.ops._C.cooperative_topk

Error:

RuntimeError: launch_cooperative_cluster,
cooperative_topk.cu:46,
cooperative_topk launch failed: invalid argument

This reproduces with DeepSeek-V4-Flash, MTP enabled, TP=2, kv-cache fp8, on SM12.1. It also reproduces whether VLLM_USE_BREAKABLE_CUDAGRAPH is auto-enabled or explicitly disabled, so breakable CUDA graphs do not appear to avoid the failing kernel.

The selector currently uses has_device_capability(90), so SM100/SM120 take the cooperative_topk path. If I locally restrict cooperative_topk to exact SM90 and let SM120 fall back to persistent_topk, the model starts successfully, completes CUDA graph profiling/capture, and reaches API server startup.

Could cooperative_topk be guarded to SM90 only, or otherwise validated/fallbacked for SM100/SM120?

@mgoin - FYI.

…cy scenarios (vllm-project#43008) Signed-off-by: LopezCastroRoberto <rocastro@redhat.com> Signed-off-by: Qiang Li <qiang.li2@amd.com>

…cy scenarios (vllm-project#43008) Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

LopezCastroRoberto marked this pull request as draft May 18, 2026 18:56

gemini-code-assist Bot reviewed May 18, 2026

View reviewed changes

LopezCastroRoberto changed the title ~~[Perf][DSv4] Add persistent_topK v2 kernel with cluster syncs and TMA~~ May 20, 2026

LopezCastroRoberto changed the title ~~[Perf][DSv4] Add cluster-cooperative topK kernel with DSMEM, TMA and adaptive dispatch~~ May 20, 2026

LopezCastroRoberto marked this pull request as ready for review May 27, 2026 15:38

LopezCastroRoberto requested a review from zyongye as a code owner May 27, 2026 15:38

LopezCastroRoberto changed the title ~~[Perf][DSv4] Add cluster-cooperative topK kernel with DSMEM, TMA~~ May 27, 2026

LopezCastroRoberto changed the title ~~[Perf][DSv4/DSv3.2][WIP] Add cluster-cooperative topK kernel with DSMEM, TMA~~ May 27, 2026

depthfirst-app Bot reviewed May 27, 2026

View reviewed changes

Comment thread csrc/libtorch_stable/cooperative_topk.cuh Outdated

Comment thread csrc/topk.cu Outdated

zyongye self-assigned this May 27, 2026

zyongye added deepseek Related to DeepSeek models DSv4 labels May 27, 2026

zyongye linked an issue May 27, 2026 that may be closed by this pull request

[Roadmap] DeepSeek V4 #40902

Closed

32 tasks

depthfirst-app Bot reviewed May 27, 2026

View reviewed changes

Comment thread csrc/cooperative_topk.cuh Outdated

mergify Bot added the needs-rebase label May 27, 2026

LopezCastroRoberto changed the title ~~[Perf][DSv4/DSv3.2][WIP] Add cluster-cooperative topK kernel for low-latency scenarios~~ May 28, 2026

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 10, 2026

LopezCastroRoberto requested review from Harry-Chen, LucasWilkinson and tlrmchlsmth as code owners June 19, 2026 13:08

mergify Bot added the ci/build label Jun 19, 2026

LopezCastroRoberto requested review from AndreasKaratzas, WoosukKwon, mgoin and yewentao256 as code owners June 19, 2026 13:51

LopezCastroRoberto force-pushed the perf/persistent_topk_v2 branch 2 times, most recently from 6259142 to 313480b Compare June 19, 2026 16:36

mergify Bot removed the needs-rebase label Jun 19, 2026

LopezCastroRoberto and others added 2 commits June 19, 2026 16:40

fix rebase problems

d867d1f

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

LopezCastroRoberto force-pushed the perf/persistent_topk_v2 branch from 313480b to d867d1f Compare June 19, 2026 16:40

LopezCastroRoberto added 2 commits June 19, 2026 16:49

fix pre-commit

b93ca2a

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

fix CI

2ccaeda

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

mgoin reviewed Jun 19, 2026

View reviewed changes

zyongye and others added 2 commits June 22, 2026 19:16

Merge branch 'main' into perf/persistent_topk_v2

eb23133

fix comments

b317d15

Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

LopezCastroRoberto requested a review from mgoin June 23, 2026 12:00

Merge branch 'main' into perf/persistent_topk_v2

5503bf4

fix pre-commit

147a3b2

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

LopezCastroRoberto requested a review from khluu as a code owner June 23, 2026 12:13

zyongye added this to the v0.24.0 cherrypick milestone Jun 23, 2026

minor

d29c385

Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>

WoosukKwon merged commit 855cd4d into vllm-project:main Jun 23, 2026
194 of 202 checks passed

mmangkad mentioned this pull request Jun 24, 2026

[CI/Build] Fix topk histogram build on SM75 #46550

Merged

khluu pushed a commit that referenced this pull request Jun 24, 2026

[Perf][DSv4/DSv3.2] Add cluster-cooperative topK kernel for low-laten…

2e0c5f5

…cy scenarios (#43008) Signed-off-by: LopezCastroRoberto <rocastro@redhat.com> (cherry picked from commit 855cd4d)

nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026

[Perf][DSv4/DSv3.2] Add cluster-cooperative topK kernel for low-laten…

f40cec9

…cy scenarios (vllm-project#43008) Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

jasl mentioned this pull request Jun 26, 2026

[New Model][Nvidia] Add SM12x support for DeepSeek V4 Flash with essential fixes #41834

Open

wincent8 pushed a commit to wincent8/vllm that referenced this pull request Jun 29, 2026

[Perf][DSv4/DSv3.2] Add cluster-cooperative topK kernel for low-laten…

8b8e948

…cy scenarios (vllm-project#43008) Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Perf][DSv4/DSv3.2] Add cluster-cooperative topK kernel for low-latency scenarios#43008

[Perf][DSv4/DSv3.2] Add cluster-cooperative topK kernel for low-latency scenarios#43008
WoosukKwon merged 9 commits into
vllm-project:mainfrom
LopezCastroRoberto:perf/persistent_topk_v2

LopezCastroRoberto commented May 18, 2026 •

edited

Loading

gemini-code-assist Bot left a comment

gemini-code-assist Bot May 18, 2026

gemini-code-assist Bot May 18, 2026

gemini-code-assist Bot May 18, 2026

gemini-code-assist Bot May 18, 2026

mergify Bot commented May 27, 2026

Uh oh!

Uh oh!

mergify Bot commented May 27, 2026

Uh oh!

mergify Bot commented May 27, 2026

mergify Bot commented May 27, 2026

LopezCastroRoberto commented May 28, 2026 •

edited

Loading

mergify Bot commented Jun 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented Jun 23, 2026

Uh oh!

eugr commented Jun 25, 2026

Labels

5 participants

Uh oh!

Uh oh!

Conversation

LopezCastroRoberto commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Approach

Microbenchmarking - vLLM topK v2 vs. v1 (B300)

topK=512:

topK=1024:

topK=2048:

E2E results (B300)

E2E results with concurrency=1 for increasing ISL - Averaged on 3 runs on B300

Accuracy

GSM8K

MRCR 2-needle eval — MAIN vs PR (DeepSeek-V4-Flash, B300, TP=4)

Potential TODOs:

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

gemini-code-assist Bot May 18, 2026

Choose a reason for hiding this comment

gemini-code-assist Bot May 18, 2026

Choose a reason for hiding this comment

gemini-code-assist Bot May 18, 2026

Choose a reason for hiding this comment

gemini-code-assist Bot May 18, 2026

Choose a reason for hiding this comment

mergify Bot commented May 27, 2026

Uh oh!

Uh oh!

mergify Bot commented May 27, 2026

Uh oh!

mergify Bot commented May 27, 2026

mergify Bot commented May 27, 2026

LopezCastroRoberto commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Update

Microbenchmarks

mergify Bot commented Jun 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify Bot commented Jun 23, 2026

Uh oh!

eugr commented Jun 25, 2026

Labels

5 participants

LopezCastroRoberto commented May 18, 2026 •

edited

Loading

LopezCastroRoberto commented May 28, 2026 •

edited

Loading