Skip to content

Fix AutoTuner tactic timing (%globaltimer) for Confidential Computing(CC)#3638

Draft
elvischenv wants to merge 1 commit into
flashinfer-ai:release-v0.6.11from
elvischenv:cc-autotuner-fixed
Draft

Fix AutoTuner tactic timing (%globaltimer) for Confidential Computing(CC)#3638
elvischenv wants to merge 1 commit into
flashinfer-ai:release-v0.6.11from
elvischenv:cc-autotuner-fixed

Conversation

@elvischenv

Copy link
Copy Markdown
Contributor

Targets release-v0.6.11 (the v0.6.11.post1 line).

Under Confidential Computing, cudaEventElapsedTime is unreliable on the bounce-buffer path (can return negative values), so AutoTuner.choose_one's min(measured_time) ranking picks a near-random tactic per rank and bakes it into the tuning cache. Time the candidate run with the GPU %globaltimer register (tiny JIT stamp kernel) instead — same return signature, so choose_one and the cache format are unchanged.

Controlled by FLASHINFER_AUTOTUNE_TIMER (auto|globaltimer|cudaevent); auto uses %globaltimer only when CC is detected (NVML), so off-CC behavior is unchanged. FLASHINFER_CONFIDENTIAL_COMPUTE=1/0 overrides detection.

See CC_AUTOTUNER_FIX.md.

Under Confidential Computing, cudaEventElapsedTime is unreliable (can return
negative values on the bounce-buffer path), so AutoTuner.choose_one's
min(measured_time) ranking picks a near-random tactic per rank and bakes it
into the tuning cache. Time the candidate run with the GPU %globaltimer
register (tiny JIT stamp kernel) instead; same return signature, so
choose_one and the cache format are unchanged.

Controlled by FLASHINFER_AUTOTUNE_TIMER (auto|globaltimer|cudaevent); auto
uses %globaltimer only when CC is detected (NVML), so off-CC is unchanged.
FLASHINFER_CONFIDENTIAL_COMPUTE=1/0 overrides detection. Mirrors TensorRT-LLM
PR #11657. See CC_AUTOTUNER_FIX.md.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0b6e46fa-2158-4f81-99db-b22199371554

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a Confidential Computing (CC) safe autotuner timing mechanism using the GPU's %globaltimer register to replace the unreliable cudaEventElapsedTime under CC environments. It includes CC detection via NVML, a JIT-compiled stamp kernel, and configuration controls. The feedback suggests optimizing the timing retrieval in pure_profile by copying the CUDA tensor to the CPU in a single transfer (ts.cpu().tolist()) instead of calling .item() twice, which reduces host-device communication overhead.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread flashinfer/autotuner.py
_run_kernels()
gt_stamp(ts[1:2])
stream.synchronize()
return (ts[1].item() - ts[0].item()) / 1e6 / repeat

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Calling .item() twice on a CUDA tensor triggers two separate synchronous device-to-host copies. Since stream.synchronize() has already been called, we can copy the entire tensor to the CPU in a single transfer and unpack it using .tolist(). This reduces host-device communication overhead during profiling.

Suggested change
return (ts[1].item() - ts[0].item()) / 1e6 / repeat
t0, t1 = ts.cpu().tolist()
return (t1 - t0) / 1e6 / repeat
@elvischenv elvischenv changed the title Confidential Computing: CC-safe AutoTuner tactic timing (%globaltimer) Jun 15, 2026
@nvpohanh

Copy link
Copy Markdown
Contributor

@elvischenv the description looks outdated. could you update it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

3 participants