Skip to content

[Perf][DSv4/DSv3.2] Add cluster-cooperative topK kernel for low-latency scenarios#43008

Merged
WoosukKwon merged 9 commits into
vllm-project:mainfrom
LopezCastroRoberto:perf/persistent_topk_v2
Jun 23, 2026
Merged

[Perf][DSv4/DSv3.2] Add cluster-cooperative topK kernel for low-latency scenarios#43008
WoosukKwon merged 9 commits into
vllm-project:mainfrom
LopezCastroRoberto:perf/persistent_topk_v2

Conversation

@LopezCastroRoberto

@LopezCastroRoberto LopezCastroRoberto commented May 18, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a cluster-cooperative topK kernel for low-latency cases, which uses TMA and DSMEM, meaning that SM90+ arch is required. This new kernel has been extensively tuned to cover the whole low-latency regime (i.e. bs ≤32) with a cluster-level cooperation via cluster.sync() and distributed SMEM histogram reduction, eliminating the complexity of the persistent scheduler and multi-CTA spin-barrier coordination in persistent_topK v1 (PR #37421).

This is also expected to improve GPU contention since other streams are running in parallel — the persistent scheduler in topK v1, for some configs, occupied all the GPU resources, starving concurrent work on other streams. This new version avoids the headroom pre-allocation that was needed to prevent the persistent kernel from deadlocking under occupancy pressure.

Approach

Additional features added to this algorithm by this PR — bs≤32:

  • Twopass streaming fallback — double-buffered TMA streaming. This enables correct low-latency topK for arbitrarily long sequences (verified up to 1M).
  • Cluster size tuning for short and medium batch sizes - CS=8 for bs≤8 (maximize per-row parallelism), CS=4 for bs 9–32 (balance row parallelism with SM utilization across rows)
  • histogram_4096_topk<12> - evolved from v1's histogram_2048_topk, widened to 4096-bin coarse histogram with warp-ballot tie-breaking for ≤64 ties, eliminating most radix refinement rounds.
  • redux.sync.add hardware warp reduce, replacing the __shfl_xor_sync butterfly tree with a single PTX instruction for warp-wide reduction.
  • Non-persistent scheduling — 1 cluster per row, no persistent loop or SM over-reservation. SMs are released as each row completes.

For bs>32, we inherited FilteredTopK from topK v1 (PR #37421):

Architecture of topK v2

  ┌─────────────────┬──────────────────────────┬────────────────────────────────────────────────────────┐
  │      Path       │        Condition         │                       Mechanism                        │
  ├─────────────────┼──────────────────────────┼────────────────────────────────────────────────────────┤
  │ Histogram 4096  │ sl ≤ 16K                 │ Warp-register histogram, no TMA                        │
  ├─────────────────┼──────────────────────────┼────────────────────────────────────────────────────────┤
  │ Fused (CS=4/8)  │ sl/CS ≤ TMA stages × 16K │ All TMA stages resident, single-pass histogram+scatter │
  ├─────────────────┼──────────────────────────┼────────────────────────────────────────────────────────┤
  │ Twopass (CS=4)  │ otherwise                │ TMA double-buffer streaming, two passes                │
  ├─────────────────┼──────────────────────────┼────────────────────────────────────────────────────────┤
  │ FilteredTopK    │ bs > 32                  │ 1 CTA per row, inherited from topK v1                  │
  └─────────────────┴──────────────────────────┴────────────────────────────────────────────────────────┘

Microbenchmarking - vLLM topK v2 vs. v1 (B300)

topK=512:

┌──────┬──────┬──────┬──────┬───────┬───────┬───────┬────────┐
│  sl  │ bs=1 │ bs=4 │ bs=8 │ bs=16 │ bs=32 │ bs=64 │ bs=128 │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 2K   │ 2.07 │ 2.41 │ 2.26 │ 2.33  │ 2.33  │ 2.03  │ 1.97   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 4K   │ 2.10 │ 2.24 │ 2.27 │ 2.24  │ 2.24  │ 2.00  │ 2.26   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 8K   │ 2.12 │ 2.05 │ 1.98 │ 1.93  │ 2.13  │ 2.11  │ 2.09   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 16K  │ 1.61 │ 1.92 │ 1.75 │ 1.71  │ 1.73  │ 1.81  │ 1.81   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 32K  │ 1.71 │ 1.65 │ 1.77 │ 1.75  │ 1.74  │ 1.69  │ 1.62   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────��───┤
│ 65K  │ 1.82 │ 2.08 │ 2.27 │ 2.66  │ 2.57  │ 1.00  │ 1.00   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 128K │ 1.79 │ 1.88 │ 2.03 │ 2.41  │ 2.32  │ 1.00  │ 1.00   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 262K │ 1.44 │ 1.44 │ 2.96 │ 1.62  │ 3.04  │ 1.00  │ 1.00   │
└──────┴──────┴──────┴──────┴───────┴───────┴───────┴────────┘

topK=1024:

┌──────┬──────┬──────┬──────┬───────┬───────┬───────┬────────┐
│  sl  │ bs=1 │ bs=4 │ bs=8 │ bs=16 │ bs=32 │ bs=64 │ bs=128 │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 2K   │ 1.48 │ 1.43 │ 2.46 │ 2.38  │ 1.97  │ 1.94  │ 2.00   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 4K   │ 2.31 │ 2.21 │ 2.24 │ 2.27  │ 2.27  │ 2.29  │ 1.58   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 8K   │ 1.98 │ 1.98 │ 2.13 │ 1.91  │ 1.86  │ 2.04  │ 2.10   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 16K  │ 1.68 │ 1.70 │ 1.69 │ 1.81  │ 1.67  │ 1.81  │ 1.85   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 32K  │ 1.89 │ 1.80 │ 1.80 │ 1.90  │ 1.84  │ 1.69  │ 1.69   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 65K  │ 1.77 │ 2.10 │ 2.26 │ 2.58  │ 2.55  │ 1.01  │ 0.99   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 128K │ 1.73 │ 1.78 │ 1.96 │ 2.34  │ 2.28  │ 1.03  │ 1.01   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 262K │ 1.42 │ 1.42 │ 2.90 │ 1.59  │ 3.00  │ 1.00  │ 1.00   │
└──────┴──────┴──────┴──────┴───────┴───────┴───────┴────────┘

topK=2048:

┌──────┬──────┬──────┬──────┬───────┬───────┬───────┬────────┐
│  sl  │ bs=1 │ bs=4 │ bs=8 │ bs=16 │ bs=32 │ bs=64 │ bs=128 │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 2K   │ 1.36 │ 1.36 │ 1.33 │ 1.33  │ 1.33  │ 1.00  │ 1.00   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 4K   │ 1.48 │ 2.00 │ 2.19 │ 2.25  │ 2.19  │ 2.00  │ 1.97   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 8K   │ 2.05 │ 2.02 │ 1.98 │ 1.91  │ 1.94  │ 2.32  │ 2.10   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 16K  │ 1.77 │ 1.86 │ 1.72 │ 1.33  │ 1.35  │ 1.35  │ 1.42   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 32K  │ 1.89 │ 1.77 │ 1.80 │ 1.86  │ 1.78  │ 1.46  │ 1.40   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 65K  │ 1.91 │ 1.87 │ 2.03 │ 2.42  │ 2.36  │ 1.00  │ 1.00   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 128K │ 1.56 │ 1.61 │ 1.78 │ 2.14  │ 2.07  │ 1.63  │ 1.00   │
├──────┼──────┼──────┼──────┼───────┼───────┼───────┼────────┤
│ 262K │ 1.27 │ 1.28 │ 2.64 │ 1.48  │ 2.79  │ 0.74  │ 1.00   │
└──────┴──────┴──────┴──────┴───────┴───────┴───────┴────────┘

E2E results (B300)

vllm serve deepseek-ai/DeepSeek-V4-Flash -tp 4 --kv-cache-dtype fp8

vllm bench serve --model deepseek-ai/DeepSeek-V4-Flash --input-len 512000 --output-len 2048 --num-prompts 8 --max-concurrency 1

MAIN:

============ Serving Benchmark Result ============
Successful requests:                     8         
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  279.07    
Total input tokens:                      4096000   
Total generated tokens:                  16384     
Request throughput (req/s):              0.03      
Output token throughput (tok/s):         58.71     
Peak output token throughput (tok/s):    126.00    
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          14736.17  
---------------Time to First Token----------------
Mean TTFT (ms):                          18711.25  
Median TTFT (ms):                        19996.18  
P99 TTFT (ms):                           27108.63  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.90      
Median TPOT (ms):                        7.90      
P99 TPOT (ms):                           7.91      
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.99      
Median ITL (ms):                         7.94      
P99 ITL (ms):                            8.14      
==================================================

PR:

============ Serving Benchmark Result ============
Successful requests:                     8         
Failed requests:                         0         
Maximum request concurrency:             1         
Benchmark duration (s):                  260.05    
Total input tokens:                      4096000   
Total generated tokens:                  16384     
Request throughput (req/s):              0.03      
Output token throughput (tok/s):         63.00     
Peak output token throughput (tok/s):    140.00    
Peak concurrent requests:                2.00      
Total token throughput (tok/s):          15813.94  
---------------Time to First Token----------------
Mean TTFT (ms):                          17825.19  
Median TTFT (ms):                        20064.59  
P99 TTFT (ms):                           20086.67  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          7.17      
Median TPOT (ms):                        7.17      
P99 TPOT (ms):                           7.24      
---------------Inter-token Latency----------------
Mean ITL (ms):                           7.26      
Median ITL (ms):                         7.15      
P99 ITL (ms):                            9.00      
==================================================

~10% TPOT improvement

E2E results with concurrency=1 for increasing ISL - Averaged on 3 runs on B300

  ┌───────┬─────────┬───────────┬─────────┬────────┬──────────────────┬────────────────┬───────────────┐
  │ Input │ topK SL │ MAIN TPOT │ PR TPOT │ TPOT Δ │ MAIN total tok/s │ PR total tok/s │ total tok/s Δ │
  ├───────┼─────────┼───────────┼─────────┼────────┼──────────────────┼────────────────┼───────────────┤
  │ 32K   │ 8K      │ 7.28      │ 7.03    │ +3.6%  │ 2261             │ 2340           │ +3.5%         │
  ├───────┼─────────┼───────────┼─────────┼────────┼──────────────────┼────────────────┼───────────────┤
  │ 64K   │ 16K     │ 7.29      │ 7.19    │ +1.5%  │ 4338             │ 4406           │ +1.6%         │
  ├───────┼─────────┼───────────┼─────────┼────────┼──────────────────┼────────────────┼───────────────┤
  │ 128K  │ 32K     │ 7.46      │ 7.32    │ +1.9%  │ 7132             │ 7250           │ +1.7%         │
  ├───────┼─────────┼───────────┼─────────┼────────┼──────────────────┼────────────────┼───────────────┤
  │ 256K  │ 64K     │ 7.71      │ 7.32    │ +5.3%  │ 11097            │ 11689          │ +5.3%         │
  ├───────┼─────────┼───────────┼─────────┼────────┼──────────────────┼────────────────┼───────────────┤
  │ 512K  │ 128K    │ 7.90      │ 7.35    │ +7.5%  │ 14943            │ 15533          │ +4.0%         │
  ├───────┼─────────┼───────────┼─────────┼────────┼──────────────────┼────────────────┼───────────────┤
  │ 1M    │ 250K    │ 8.21      │ 8.04    │ +2.1%  │ 14633            │ 14715          │ +0.6%         │
  └───────┴─────────┴───────────┴─────────┴────────┴──────────────────┴────────────────┴───────────────┘
Screenshot 2026-05-28 at 14 16 03

#Conclusions: TPOT is super flat for ISL 32K up to 512K - note that the difference is 7.1ms vs 7.3ms. For 1M it increases a bit more than expected - should probably be studied separately. These conclusions match the results in the microbenchmarks (topK=512): while for 32-128K the topK v2 kernel gets 70-80% perf improvement w.r.t. v1, for 262K the improvement is ~40%.

UPDATE: Check-out the latest follow-up comments on this PR below.

Accuracy

GSM8K

python tests/evals/gsm8k/gsm8k_eval.py

MAIN:

Results:
Accuracy: 0.948
Invalid responses: 0.000
Total latency: 37.024 s
Questions per second: 35.625
Total output tokens: 116954
Output tokens per second: 3158.851

PR:

Results:
Accuracy: 0.948
Invalid responses: 0.000
Total latency: 37.334 s
Questions per second: 35.330
Total output tokens: 117784
Output tokens per second: 3154.878

MRCR 2-needle eval — MAIN vs PR (DeepSeek-V4-Flash, B300, TP=4)


  ┌──────────┬────────────┬──────────┬─────────┐
  │  Bucket  │ MAIN score │ PR score │    Δ    │
  ├──────────┼────────────┼──────────┼─────────┤
  │ 0-8K     │ 0.9885     │ 0.9880   │ -0.0005 │
  ├──────────┼────────────┼──────────┼─────────┤
  │ 16K-32K  │ 0.9209     │ 0.9303   │ +0.0094 │
  ├──────────┼────────────┼──────────┼─────────┤
  │ 32K-64K  │ 0.9194     │ 0.9196   │ +0.0002 │
  ├──────────┼────────────┼──────────┼─────────┤
  │ 64K-128K │ 0.9392     │ 0.9593   │ +0.0201 │
  └──────────┴────────────┴──────────┴─────────┘

Potential TODOs:

  • Stride alignment -- TMA requires logits.stride(0) % 4 == 0, enforced via TORCH_CHECK rn. Always true when stride = max_model_len from model config, but odd user-supplied --max-model-len would crash. Decide whether to keep topK v1 as fallback, pad in the dispatcher, or just trigger TORCH_CHECK for non multiple of 4 --max-model-len configs.
  • Long-context evals — Run MRCR eval at long context to verify e2e correctness beyond kernel-level tests and GSM8K evals. Implement openai/mrcr long context evaluation benchmark EleutherAI/lm-evaluation-harness#3754 is perfect for that.
  • Cross-platform benchmarks — Repeat benchmakrs on B200 and H200.
  • (Optional) Run e2e benchmarks on DSv4-Pro too.
@LopezCastroRoberto LopezCastroRoberto marked this pull request as draft May 18, 2026 18:56

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a cluster-persistent TopK implementation for SM90+ architectures, leveraging TMA and DSMEM for improved performance at K=1024. The review feedback identifies critical correctness issues related to potential out-of-bounds memory accesses during TMA load operations due to incorrect size calculations. It also suggests using unsigned integers for counters to improve type safety and consistency.

Comment thread csrc/cluster_topk.cuh Outdated
for (uint32_t i = 0; i < kNumStages8; i++) {
if (i >= num_iters) break;
const auto off = i * kSizePerStage;
const auto sz = min(kSizePerStage, len_aligned - off) * sizeof(float);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The tma_load operation uses len_aligned - off to determine the size of the data to load. len_aligned is the my_len value rounded up to the nearest multiple of 4. If my_len is not a multiple of 4, len_aligned will be greater than my_len. This can lead to tma_load attempting to read beyond the actual data available (my_len), resulting in an out-of-bounds memory access. The size should be capped by the actual my_len.

For example, if my_len = 5 and kAlign = 4, then len_aligned = 8. If off = 0, sz would be calculated based on 8 - 0 = 8 elements, but only 5 are valid. This is a critical correctness issue.

      const auto sz = min(kSizePerStage, my_len - off) * sizeof(float);
Comment thread csrc/cluster_topk.cuh Outdated
for (uint32_t i = 0; i < kNumStages4; i++) {
if (i >= ni) break;
const auto o = i * kSizePerStage;
const auto sz = min(kSizePerStage, la - o) * sizeof(float);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Similar to the issue on line 505, the tma_load in stream_pass4 uses la - o to determine the size. la is the length value rounded up to the nearest multiple of 4. This can cause an out-of-bounds read if length is not a multiple of 4, as tma_load might attempt to read beyond the actual length of the data. The size should be capped by the actual length.

This is a critical correctness issue.

      const auto sz = min(kSizePerStage, length - o) * sizeof(float);
Comment thread csrc/cluster_topk.cuh Outdated
const auto u = (sl + kA-1)/kA, b = u/CS, e = u%CS;
const auto lu = b + (rank < e ? 1u : 0u);
const auto ou = rank * b + min(rank, e);
const auto ms = ou * kA, ml = min(ms + lu * kA, sl) - ms;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This line in large_topk_twopass4 appears to be a copy-paste error from large_topk_fused8. It incorrectly uses len_aligned - off (which is not defined in this scope and would be a compilation error if not for len_aligned being defined in large_topk_fused8 but not here, leading to potential undefined behavior or a compiler error depending on context). It should use ml - off, which represents the actual length of the current partition. Using an undefined or incorrect variable for tma_load size is a critical correctness issue.

      const auto sz = min(kSizePerStage, ml - off) * sizeof(float);
Comment thread csrc/cluster_topk.cuh Outdated

struct alignas(16) MatchBin { uint32_t bin, above_count, equal_count; };
struct alignas(8) Tie { uint32_t idx; float score; };
struct ClusterState { int output_counter; };

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The output_counter in ClusterState is declared as an int. While unlikely to overflow with current K and kMaxTies values, it's generally safer and more consistent to use uint32_t for counters that are always non-negative, especially when uint32_t values (la, le) are cast to int before being added atomically. This prevents any potential issues if la or le were to exceed INT_MAX in future modifications.

struct ClusterState { uint32_t output_counter; };
@LopezCastroRoberto LopezCastroRoberto changed the title [Perf][DSv4] Add persistent_topK v2 kernel with cluster syncs and TMA May 20, 2026
@LopezCastroRoberto LopezCastroRoberto changed the title [Perf][DSv4] Add cluster-cooperative topK kernel with DSMEM, TMA and adaptive dispatch May 20, 2026
@LopezCastroRoberto LopezCastroRoberto marked this pull request as ready for review May 27, 2026 15:38
@LopezCastroRoberto LopezCastroRoberto changed the title [Perf][DSv4] Add cluster-cooperative topK kernel with DSMEM, TMA May 27, 2026
@LopezCastroRoberto LopezCastroRoberto changed the title [Perf][DSv4/DSv3.2][WIP] Add cluster-cooperative topK kernel with DSMEM, TMA May 27, 2026
@mergify

mergify Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

Hi @LopezCastroRoberto, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
Comment thread csrc/libtorch_stable/cooperative_topk.cuh Outdated
Comment thread csrc/topk.cu Outdated
@mergify

mergify Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

Hi @LopezCastroRoberto, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
@zyongye zyongye self-assigned this May 27, 2026
@zyongye zyongye added deepseek Related to DeepSeek models DSv4 labels May 27, 2026
@zyongye zyongye linked an issue May 27, 2026 that may be closed by this pull request
32 tasks
Comment thread csrc/cooperative_topk.cuh Outdated
@mergify

mergify Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

Hi @LopezCastroRoberto, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
@mergify

mergify Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LopezCastroRoberto.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify Bot added the needs-rebase label May 27, 2026
@LopezCastroRoberto LopezCastroRoberto changed the title [Perf][DSv4/DSv3.2][WIP] Add cluster-cooperative topK kernel for low-latency scenarios May 28, 2026
@LopezCastroRoberto

LopezCastroRoberto commented May 28, 2026

Copy link
Copy Markdown
Contributor Author

Update

Seems like increasing the Cluster Size to 16 for bs<=4 potentially improves the long-context case significantly (more per-row parallelism, which becomes critical for long rows).

On Blackwell:

The maximum portable cluster size supported is 8; however, NVIDIA Blackwell B200 GPU allows for a nonportable cluster size of 16 by opting in

src: https://docs.nvidia.com/cuda/blackwell-tuning-guide/index.html#thread-block-clusters

Microbenchmarks

  CS=8 (bs≤4) vs v1                    CS=16 (bs≤4) vs v1
  ┌──────┬──────┬──────┐               ┌──────┬──────┬──────┐
  │  sl  │ bs=1 │ bs=4 │               │  sl  │ bs=1 │ bs=4 │
  ├──────┼──────┼──────┤               ├──────┼──────┼──────┤
  │ 2K   │ 1.48 │ 1.43 │               │ 2K   │ 2.31 │ 1.46 │
  │ 4K   │ 2.31 │ 2.21 │               │ 4K   │ 2.45 │ 2.27 │
  │ 8K   │ 1.98 │ 1.98 │               │ 8K   │ 2.35 │ 2.00 │
  │ 16K  │ 1.68 │ 1.70 │               │ 16K  │ 1.77 │ 1.71 │
  │ 32K  │ 1.89 │ 1.80 │               │ 32K  │ 1.79 │ 1.67 │
  │ 65K  │ 1.77 │ 2.10 │               │ 65K  │ 2.12 │ 2.04 │
  │ 128K │ 1.73 │ 1.78 │               │ 128K │ 1.93 │ 1.90 │
  │ 262K │ 1.42 │ 1.42 │               │ 262K │ 1.70 │ 1.66 │
  └──────┴──────┴──────┘               └──────┴──────┴──────┘
@mgoin mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 10, 2026
@mergify mergify Bot added the ci/build label Jun 19, 2026
@LopezCastroRoberto LopezCastroRoberto force-pushed the perf/persistent_topk_v2 branch 2 times, most recently from 6259142 to 313480b Compare June 19, 2026 16:36
@mergify mergify Bot removed the needs-rebase label Jun 19, 2026
LopezCastroRoberto and others added 2 commits June 19, 2026 16:40
Port cooperative cluster top-k kernels and launchers to
csrc/libtorch_stable/, gate registration with
VLLM_ENABLE_COOPERATIVE_TOPK, and route decode sparse indexer
to cooperative_topk when eligible.

Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
@LopezCastroRoberto LopezCastroRoberto force-pushed the perf/persistent_topk_v2 branch from 313480b to d867d1f Compare June 19, 2026 16:40
@mergify

mergify Bot commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

Hi @LopezCastroRoberto, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Comment thread csrc/libtorch_stable/cooperative_topk.cuh
Comment thread csrc/libtorch_stable/cooperative_topk.cuh Outdated
Comment thread csrc/libtorch_stable/persistent_topk.cuh Outdated
Comment thread csrc/libtorch_stable/cooperative_topk.cuh Outdated
Comment thread csrc/libtorch_stable/cooperative_topk.cu
Comment thread CMakeLists.txt Outdated
zyongye and others added 2 commits June 22, 2026 19:16
Co-authored-by: OpenAI Codex <codex@openai.com>

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
@LopezCastroRoberto LopezCastroRoberto requested a review from mgoin June 23, 2026 12:00
@mergify

mergify Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Hi @LopezCastroRoberto, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
@zyongye zyongye added this to the v0.24.0 cherrypick milestone Jun 23, 2026
Signed-off-by: Woosuk Kwon <woosuk@inferact.ai>
@WoosukKwon WoosukKwon merged commit 855cd4d into vllm-project:main Jun 23, 2026
194 of 202 checks passed
khluu pushed a commit that referenced this pull request Jun 24, 2026
…cy scenarios (#43008)

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
(cherry picked from commit 855cd4d)
nkzhenhua pushed a commit to nkzhenhua/vllm that referenced this pull request Jun 24, 2026
…cy scenarios (vllm-project#43008)

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
@eugr

eugr commented Jun 25, 2026

Copy link
Copy Markdown

@LopezCastroRoberto - this PR breaks DeepSeek V4 Flash on DGX Spark (sm120).

It fails during startup CUDA graph memory profiling.

The crash is in the sparse attention indexer path added/changed by this PR:

sparse_attn_indexer -> torch.ops._C.cooperative_topk

Error:

RuntimeError: launch_cooperative_cluster,
cooperative_topk.cu:46,
cooperative_topk launch failed: invalid argument

This reproduces with DeepSeek-V4-Flash, MTP enabled, TP=2, kv-cache fp8, on SM12.1. It also reproduces whether VLLM_USE_BREAKABLE_CUDAGRAPH is auto-enabled or explicitly disabled, so breakable CUDA graphs do not appear to avoid the failing kernel.

The selector currently uses has_device_capability(90), so SM100/SM120 take the cooperative_topk path. If I locally restrict cooperative_topk to exact SM90 and let SM120 fall back to persistent_topk, the model starts successfully, completes CUDA graph profiling/capture, and reaches API server startup.

Could cooperative_topk be guarded to SM90 only, or otherwise validated/fallbacked for SM100/SM120?

@mgoin - FYI.

qli88 pushed a commit to qli88/vllm that referenced this pull request Jun 26, 2026
…cy scenarios (vllm-project#43008)

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Signed-off-by: Qiang Li <qiang.li2@amd.com>
wincent8 pushed a commit to wincent8/vllm that referenced this pull request Jun 29, 2026
…cy scenarios (vllm-project#43008)

Signed-off-by: LopezCastroRoberto <rocastro@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build deepseek Related to DeepSeek models DSv4 ready ONLY add when PR is ready to merge/full CI is needed

5 participants