[dashboard] Add py-spy --idle and --subprocesses flags to profiling endpoints#63852
Conversation
The dashboard CPU profiling endpoints (/task/cpu_profile and /worker/cpu_profile) run `py-spy record`, which by default only samples on-CPU (runnable) threads. py-spy's `--idle` flag additionally includes off-CPU / sleeping threads (blocked on locks, I/O, CUDA syncs), which is essential for diagnosing a stalled server. This plumbs an `idle` query parameter through the same path the existing `native` flag uses: the `idle=1` query param on the endpoint -> the `idle` field on the CpuProfilingRequest proto -> the reporter agent handler -> CpuProfilingManager.cpu_profile, which appends `--idle` to the py-spy command. Unlike `--native` (linux-only), `--idle` is supported on all platforms and is therefore not platform-gated. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
py-spy can profile a process together with its child processes via the `--subprocesses` flag. This is useful for Ray workers that fork children that do the real work -- e.g. PyTorch DataLoader workers, multiprocessing pools, or multiproc inference backends (vLLM tensor-parallel workers) -- where the parent worker is mostly idle and the interesting CPU activity lives in subprocesses the flamegraph would otherwise miss. This stacks on the `idle` change and plumbs a `subprocesses` query parameter through the same path as `native`/`idle`: `subprocesses=1` on the endpoint -> the `subprocesses` field on the CpuProfilingRequest proto -> the reporter agent handler -> CpuProfilingManager.cpu_profile, which appends `--subprocesses` to the py-spy command. Like `--idle`, it is supported on all platforms and is not platform-gated. Because py-spy discovers child processes by periodically scanning the process tree, very short-lived subprocesses may be missed; persistent workers (DataLoader, vLLM) are captured reliably. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
There was a problem hiding this comment.
Code Review
This pull request adds support for the --idle (to include off-CPU/sleeping threads) and --subprocesses (to profile child processes) flags in the CPU profiling feature using py-spy. It updates the protobuf definitions, the dashboard reporter agent and head, the profiling manager, the documentation, and adds corresponding unit tests. There are no review comments, so I have no feedback to provide.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
py-spy's `dump` subcommand supports `--subprocesses` just like `record`, but the dashboard's "Stack Trace" path (GetTraceback -> trace_dump) only ever passed `--native`, so child-process stacks could not be captured there. This mirrors the cpu_profile change onto the traceback path so a worker that forks children (PyTorch DataLoader, multiprocessing, or vLLM tensor-parallel workers) can have those child stacks dumped too. A `subprocesses` query parameter on the `/task/traceback` and `/worker/traceback` endpoints is plumbed through the `subprocesses` field on the GetTracebackRequest proto and the GetTraceback agent handler into CpuProfilingManager.trace_dump, which appends `--subprocesses` to the `py-spy dump` command. Like the other flags, it is not platform-gated. Note the asymmetry with `--idle`: idle is only meaningful for `record` (py-spy `dump` already snapshots all threads, including off-CPU ones), whereas `--subprocesses` is a genuine capability gap on the dump path -- hence this follow-up extends the traceback path too. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com>
kshanmol
left a comment
There was a problem hiding this comment.
LGTM, thanks for this improvement!
…ndpoints (ray-project#63852) The dashboard profiling endpoints run `py-spy` — `py-spy record` for the **CPU Flame Graph** (`/task/cpu_profile`, `/worker/cpu_profile`) and `py-spy dump` for the **Stack Trace** (`/task/traceback`, `/worker/traceback`) — but only ever exposed the `native` flag. This PR plumbs two more py-spy flags through the same paths (query param → request proto field → reporter agent handler → `CpuProfilingManager`): - **`idle=1` → `py-spy record --idle`** (CPU Flame Graph). `py-spy record` samples only on-CPU (runnable) threads by default, so a worker blocked on a lock, I/O, or a CUDA sync shows up as near-idle and the flame graph is empty/misleading exactly when you need it most. `--idle` additionally captures off-CPU / sleeping threads, surfacing where a stalled worker is actually parked. - **`subprocesses=1` → `py-spy --subprocesses`** (both CPU Flame Graph *and* Stack Trace). A Ray worker is normally a single process, but many workloads fork children that do the real work — PyTorch `DataLoader(num_workers>0)`, `multiprocessing` pools, or multiproc inference backends (e.g. vLLM tensor-parallel workers). `--subprocesses` follows the worker's process tree so activity in those children appears in the profile / stack dump instead of being invisible. **Why `idle` is record-only but `subprocesses` covers both paths:** `py-spy dump` already snapshots *all* threads (including off-CPU / blocked ones), so `--idle` is only meaningful for `record`. `--subprocesses`, however, is a real capability gap on the dump path — without it, child stacks can't be captured via the Stack Trace button — so it is plumbed through `trace_dump` as well. All flags default off and, unlike `--native` (which Ray restricts to Linux), work on all platforms, so none are platform-gated. - **Commits** (stacked): (1) `--idle` on `cpu_profile`, (2) `--subprocesses` on `cpu_profile`, (3) `--subprocesses` on the traceback endpoints. - **Docs:** all flags are documented in the Dashboard profiling guide (`optimize-performance.rst`) — append `&idle=1` and/or `&subprocesses=1` to the relevant request URL. - **Tests:** `test_profile_manager.py` asserts each flag is appended to the constructed py-spy command iff requested (and absent by default), for both `cpu_profile` (record) and `trace_dump` (dump). - **Caveat (subprocesses):** py-spy discovers child processes by periodically scanning the process tree, so very short-lived subprocesses may be missed; persistent workers (DataLoader, vLLM) are captured reliably. - The generated `reporter_pb2` bindings are gitignored and regenerated in CI, where the agent/head code paths and these tests run. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Signed-off-by: Marwan Sarieddine <sarieddine.marwan@gmail.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Description
The dashboard profiling endpoints run
py-spy—py-spy recordfor the CPU Flame Graph (/task/cpu_profile,/worker/cpu_profile) andpy-spy dumpfor the Stack Trace (/task/traceback,/worker/traceback) — but only ever exposed thenativeflag. This PR plumbs two more py-spy flags through the same paths (query param → request proto field → reporter agent handler →CpuProfilingManager):idle=1→py-spy record --idle(CPU Flame Graph).py-spy recordsamples only on-CPU (runnable) threads by default, so a worker blocked on a lock, I/O, or a CUDA sync shows up as near-idle and the flame graph is empty/misleading exactly when you need it most.--idleadditionally captures off-CPU / sleeping threads, surfacing where a stalled worker is actually parked.subprocesses=1→py-spy --subprocesses(both CPU Flame Graph and Stack Trace). A Ray worker is normally a single process, but many workloads fork children that do the real work — PyTorchDataLoader(num_workers>0),multiprocessingpools, or multiproc inference backends (e.g. vLLM tensor-parallel workers).--subprocessesfollows the worker's process tree so activity in those children appears in the profile / stack dump instead of being invisible.Why
idleis record-only butsubprocessescovers both paths:py-spy dumpalready snapshots all threads (including off-CPU / blocked ones), so--idleis only meaningful forrecord.--subprocesses, however, is a real capability gap on the dump path — without it, child stacks can't be captured via the Stack Trace button — so it is plumbed throughtrace_dumpas well.All flags default off and, unlike
--native(which Ray restricts to Linux), work on all platforms, so none are platform-gated.Related issues
N/A
Additional information
--idleoncpu_profile, (2)--subprocessesoncpu_profile, (3)--subprocesseson the traceback endpoints.optimize-performance.rst) — append&idle=1and/or&subprocesses=1to the relevant request URL.test_profile_manager.pyasserts each flag is appended to the constructed py-spy command iff requested (and absent by default), for bothcpu_profile(record) andtrace_dump(dump).reporter_pb2bindings are gitignored and regenerated in CI, where the agent/head code paths and these tests run.🤖 Generated with Claude Code