Fix stale heartbeat process PIDs cleanup issue#902
Conversation
✅ Deploy Preview for agent-sandbox canceled.
|
There was a problem hiding this comment.
Pull request overview
Fixes a PID-recycling hazard in resourcectl cleanup where stale heartbeat PIDs persisted in ~/.local/resourcectl/state.json could cause SIGKILL to be delivered to an unrelated, recycled process. The fix verifies the target process's identity via /proc/<pid>/cmdline on Linux before killing, always zeroes stale PIDs, prunes fully-released resources from the state file, and tolerates ESRCH. Unit tests are added covering matching, innocent (non-heartbeat), and dead-PID scenarios.
Changes:
- Verify
/proc/<pid>/cmdlinecontainsheartbeatand--name <resource>before killing (Linux only); ignoreESRCH. - Unconditionally clear
HeartbeatPIDafter the kill attempt and remove fully-released resources from state instead of leaving empty entries. - Add
main_test.gowith a helper-process pattern and arunCleanupintegration test using a mock Boskos server.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| dev/tools/resourcectl/main.go | Adds isHeartbeatProcess PID verification, ESRCH tolerance, and prunes released resources from state. |
| dev/tools/resourcectl/main_test.go | New tests covering isHeartbeatProcess and the full runCleanup flow with matching, innocent, and dead PIDs. |
…ary process termination #### What this PR does / why we need it: Fixes a logic vulnerability in `resourcectl cleanup` where stale heartbeat process PIDs remained permanently trapped in the local state file (`~/.local/resourcectl/state.json`) if the heartbeat process exited prematurely. When the OS eventually recycled that PID and assigned it to a new innocent process, the subsequent execution of `resourcectl cleanup` would violently terminate that unrelated process via `SIGKILL`. To fix this, we: - Safely verify the heartbeat process identity on Linux by reading and parsing `/proc/<pid>/cmdline` to ensure it contains the `"heartbeat"` and `"--name" <resource>` arguments before attempting to kill it. - Guarantee that stale PIDs are zeroed in `state.json` even if the kill operation fails. - Prune fully released resources completely from the state file rather than leaving empty structs, preventing unbounded file growth. - Ignore `ESRCH` errors if a heartbeat process is already dead. - Added comprehensive unit tests in `main_test.go` to reproduce the conditions and verify safety and cleanup behavior. #### Which issue(s) this PR is related to: Fixes b/511322601 #### Release Note ```release-note Fixed a logic issue in `resourcectl` where stale heartbeat PIDs could lead to arbitrary process termination due to PID recycling. ```
c9e2b37 to
13ca5cf
Compare
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aditya-shantanu, barney-s The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
* dev/tools/resourcectl: fix persistent PID landmine and prevent arbitrary process termination #### What this PR does / why we need it: Fixes a logic vulnerability in `resourcectl cleanup` where stale heartbeat process PIDs remained permanently trapped in the local state file (`~/.local/resourcectl/state.json`) if the heartbeat process exited prematurely. When the OS eventually recycled that PID and assigned it to a new innocent process, the subsequent execution of `resourcectl cleanup` would violently terminate that unrelated process via `SIGKILL`. To fix this, we: - Safely verify the heartbeat process identity on Linux by reading and parsing `/proc/<pid>/cmdline` to ensure it contains the `"heartbeat"` and `"--name" <resource>` arguments before attempting to kill it. - Guarantee that stale PIDs are zeroed in `state.json` even if the kill operation fails. - Prune fully released resources completely from the state file rather than leaving empty structs, preventing unbounded file growth. - Ignore `ESRCH` errors if a heartbeat process is already dead. - Added comprehensive unit tests in `main_test.go` to reproduce the conditions and verify safety and cleanup behavior. #### Which issue(s) this PR is related to: Fixes b/511322601 #### Release Note ```release-note Fixed a logic issue in `resourcectl` where stale heartbeat PIDs could lead to arbitrary process termination due to PID recycling. ``` * address feedback * pr feedback
* dev/tools/resourcectl: fix persistent PID landmine and prevent arbitrary process termination #### What this PR does / why we need it: Fixes a logic vulnerability in `resourcectl cleanup` where stale heartbeat process PIDs remained permanently trapped in the local state file (`~/.local/resourcectl/state.json`) if the heartbeat process exited prematurely. When the OS eventually recycled that PID and assigned it to a new innocent process, the subsequent execution of `resourcectl cleanup` would violently terminate that unrelated process via `SIGKILL`. To fix this, we: - Safely verify the heartbeat process identity on Linux by reading and parsing `/proc/<pid>/cmdline` to ensure it contains the `"heartbeat"` and `"--name" <resource>` arguments before attempting to kill it. - Guarantee that stale PIDs are zeroed in `state.json` even if the kill operation fails. - Prune fully released resources completely from the state file rather than leaving empty structs, preventing unbounded file growth. - Ignore `ESRCH` errors if a heartbeat process is already dead. - Added comprehensive unit tests in `main_test.go` to reproduce the conditions and verify safety and cleanup behavior. #### Which issue(s) this PR is related to: Fixes b/511322601 #### Release Note ```release-note Fixed a logic issue in `resourcectl` where stale heartbeat PIDs could lead to arbitrary process termination due to PID recycling. ``` * address feedback * pr feedback
* dev/tools/resourcectl: fix persistent PID landmine and prevent arbitrary process termination #### What this PR does / why we need it: Fixes a logic vulnerability in `resourcectl cleanup` where stale heartbeat process PIDs remained permanently trapped in the local state file (`~/.local/resourcectl/state.json`) if the heartbeat process exited prematurely. When the OS eventually recycled that PID and assigned it to a new innocent process, the subsequent execution of `resourcectl cleanup` would violently terminate that unrelated process via `SIGKILL`. To fix this, we: - Safely verify the heartbeat process identity on Linux by reading and parsing `/proc/<pid>/cmdline` to ensure it contains the `"heartbeat"` and `"--name" <resource>` arguments before attempting to kill it. - Guarantee that stale PIDs are zeroed in `state.json` even if the kill operation fails. - Prune fully released resources completely from the state file rather than leaving empty structs, preventing unbounded file growth. - Ignore `ESRCH` errors if a heartbeat process is already dead. - Added comprehensive unit tests in `main_test.go` to reproduce the conditions and verify safety and cleanup behavior. #### Which issue(s) this PR is related to: Fixes b/511322601 #### Release Note ```release-note Fixed a logic issue in `resourcectl` where stale heartbeat PIDs could lead to arbitrary process termination due to PID recycling. ``` * address feedback * pr feedback
What this PR does / why we need it:
Fixes a logic vulnerability in
resourcectl cleanupwhere stale heartbeat process PIDs remained permanently trapped in the local state file (~/.local/resourcectl/state.json) if the heartbeat process exited prematurely. When the OS eventually recycled that PID and assigned it to a new innocent process, the subsequent execution ofresourcectl cleanupwould violently terminate that unrelated process viaSIGKILL.To fix this, we:
/proc/<pid>/cmdlineto ensure it contains the"heartbeat"and"--name" <resource>arguments before attempting to kill it.state.jsoneven if the kill operation fails.ESRCHerrors if a heartbeat process is already dead.main_test.goto reproduce the conditions and verify safety and cleanup behavior.Which issue(s) this PR is related to:
Fixes b/511322601
Release Note