Fix AutoTuner tactic timing (%globaltimer) for Confidential Computing(CC)#3638
Fix AutoTuner tactic timing (%globaltimer) for Confidential Computing(CC)#3638elvischenv wants to merge 1 commit into
Conversation
Under Confidential Computing, cudaEventElapsedTime is unreliable (can return negative values on the bounce-buffer path), so AutoTuner.choose_one's min(measured_time) ranking picks a near-random tactic per rank and bakes it into the tuning cache. Time the candidate run with the GPU %globaltimer register (tiny JIT stamp kernel) instead; same return signature, so choose_one and the cache format are unchanged. Controlled by FLASHINFER_AUTOTUNE_TIMER (auto|globaltimer|cudaevent); auto uses %globaltimer only when CC is detected (NVML), so off-CC is unchanged. FLASHINFER_CONFIDENTIAL_COMPUTE=1/0 overrides detection. Mirrors TensorRT-LLM PR #11657. See CC_AUTOTUNER_FIX.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a Confidential Computing (CC) safe autotuner timing mechanism using the GPU's %globaltimer register to replace the unreliable cudaEventElapsedTime under CC environments. It includes CC detection via NVML, a JIT-compiled stamp kernel, and configuration controls. The feedback suggests optimizing the timing retrieval in pure_profile by copying the CUDA tensor to the CPU in a single transfer (ts.cpu().tolist()) instead of calling .item() twice, which reduces host-device communication overhead.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| _run_kernels() | ||
| gt_stamp(ts[1:2]) | ||
| stream.synchronize() | ||
| return (ts[1].item() - ts[0].item()) / 1e6 / repeat |
There was a problem hiding this comment.
Calling .item() twice on a CUDA tensor triggers two separate synchronous device-to-host copies. Since stream.synchronize() has already been called, we can copy the entire tensor to the CPU in a single transfer and unpack it using .tolist(). This reduces host-device communication overhead during profiling.
| return (ts[1].item() - ts[0].item()) / 1e6 / repeat | |
| t0, t1 = ts.cpu().tolist() | |
| return (t1 - t0) / 1e6 / repeat |
|
@elvischenv the description looks outdated. could you update it? |
Targets
release-v0.6.11(thev0.6.11.post1line).Under Confidential Computing,
cudaEventElapsedTimeis unreliable on the bounce-buffer path (can return negative values), soAutoTuner.choose_one'smin(measured_time)ranking picks a near-random tactic per rank and bakes it into the tuning cache. Time the candidate run with the GPU%globaltimerregister (tiny JIT stamp kernel) instead — same return signature, sochoose_oneand the cache format are unchanged.Controlled by
FLASHINFER_AUTOTUNE_TIMER(auto|globaltimer|cudaevent);autouses%globaltimeronly when CC is detected (NVML), so off-CC behavior is unchanged.FLASHINFER_CONFIDENTIAL_COMPUTE=1/0overrides detection.See
CC_AUTOTUNER_FIX.md.