Fix concurrency race condition in resourcectl local state management#934
Conversation
#### What this PR does / why we need it: This PR fixes a concurrency race condition in the local state management of the `resourcectl` CLI utility (`dev/tools/resourcectl/main.go`). Previously, multiple instances of `resourcectl get` running concurrently would read the same `~/.local/resourcectl/state.json` file and overwrite each other's additions, leading to permanent loss of tracked Boskos resources and resource leaks as their heartbeat processes ran indefinitely. This change introduces file-based locking around the read-modify-write cycle of the state file using `github.com/gofrs/flock` to guarantee atomicity and prevent data loss. #### Which issue(s) this PR is related to: Fixes https://b.corp.google.com/issues/511322627 #### Release Note ```release-note NONE ```
✅ Deploy Preview for agent-sandbox canceled.
|
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdd file-based locking to serialize state.json access: introduce a lock-file helper and updateState(fn) that acquires a flock for guarded read-modify-write, refactor runGet/runCleanup to use the helper, add a concurrent test, and promote flock to a direct go.mod dependency. ChangesState file locking via flock
🎯 4 (Complex) | ⏱️ ~45 minutes
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@dev/tools/resourcectl/main.go`:
- Around line 319-327: The file lock acquired via stateLockFilePath() and
fileLock.Lock() currently surrounds the entire cleanup kill/release loop and
blocks other state operations; change to a two-phase approach: acquire fileLock
only to read and deserialize state (use fileLock.Lock()/Unlock()), then
immediately release before performing slow external kill/release calls, and
finally re-acquire the lock to apply and persist state changes (use
fileLock.Lock()/Unlock() around the write). Locate usages of
stateLockFilePath(), fileLock, and the cleanup kill/release loop in the cleanup
routine and refactor so external calls run outside the locked section;
re-acquire the same lock only when writing back updated state to disk.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 4cb68afc-be34-41f3-9982-5a955c94c191
📒 Files selected for processing (1)
dev/tools/resourcectl/main.go
There was a problem hiding this comment.
Pull request overview
This PR addresses a real concurrency issue in resourcectl’s local state management by introducing a file lock to protect the read/modify/write cycle of ~/.local/resourcectl/state.json, preventing concurrent get/cleanup invocations from overwriting each other and losing tracked resources.
Changes:
- Add a dedicated lock file (
state.lock) colocated with the state file. - Use
github.com/gofrs/flockto serialize state updates inrunGetandrunCleanup.
Address PR review feedback: - runCleanup: split into two phases so the state-file lock is no longer held across the slow kill/release calls (network I/O to Boskos). Phase 1 atomically takes ownership of the tracked resources and clears the state under the lock; phase 2 kills heartbeats and releases from Boskos without the lock; phase 3 re-acquires the lock and re-adds only the resources that failed to release, re-reading state so concurrently added resources are not clobbered. - runGet: acquire the state lock before starting the heartbeat process so a process blocked on the lock never leaves an orphaned heartbeat if it is interrupted before persisting the resource. - Add updateState helper for the locked read-modify-write cycle shared by runGet and runCleanup. - Add TestConcurrentStateUpdates exercising concurrent state updates under contention and asserting no entries are lost. - go.mod: promote github.com/gofrs/flock from indirect to a direct dependency (go mod tidy), fixing presubmit-test-autogen-up-to-date.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
dev/tools/resourcectl/main_test.go (1)
125-130: ⚡ Quick winAssert the resource identities, not just the final count.
A broken write path can still satisfy
len(state.BoskosResources) == nwhile duplicating or corrupting entries. Checking the expectedresource-%dset would make this test actually prove that no update was clobbered under contention.Proposed assertion tightening
state, err := readState() if err != nil { t.Fatalf("failed to read state file: %v", err) } if len(state.BoskosResources) != n { t.Errorf("expected %d resources after concurrent updates, but got %d", n, len(state.BoskosResources)) } + seen := make(map[string]int, len(state.BoskosResources)) + for _, resource := range state.BoskosResources { + seen[resource.Name]++ + } + for i := 0; i < n; i++ { + name := fmt.Sprintf("resource-%d", i) + if seen[name] != 1 { + t.Fatalf("expected exactly one %q entry, got %d", name, seen[name]) + } + }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@dev/tools/resourcectl/main_test.go` around lines 125 - 130, The test reads state via readState() and only asserts len(state.BoskosResources) == n, which can miss duplicates or corrupted entries; update the test to assert the exact set of resource IDs (e.g., "resource-%d") is present and unique in state.BoskosResources after concurrent updates: build the expected names for i in 0..n-1, iterate state.BoskosResources to collect actual IDs, verify the sets are equal and that there are no duplicates (or that every expected "resource-%d" exists), and fail the test with a clear message if any expected ID is missing or duplicated.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@dev/tools/resourcectl/main.go`:
- Around line 302-320: The code starts the heartbeat process (cmd) then appends
a BoskosResource and relies on updateState/writeState to persist it; if
persistence fails the heartbeat is left running. After cmd.Start() and before
returning the error from updateState/writeState failures, ensure best-effort
cleanup: kill the heartbeat process group (use negative PID since
SysProcAttr.Setpgid=true), wait for the process to exit, and if SIGTERM fails
escalate to SIGKILL; log any cleanup errors. Update the runGet/updateState
caller path to perform this cleanup when updateState or writeState returns an
error so no orphaned heartbeat (referenced symbols: cmd, SysProcAttr,
cmd.Process.Pid, BoskosResource, updateState, writeState, heartbeat).
---
Nitpick comments:
In `@dev/tools/resourcectl/main_test.go`:
- Around line 125-130: The test reads state via readState() and only asserts
len(state.BoskosResources) == n, which can miss duplicates or corrupted entries;
update the test to assert the exact set of resource IDs (e.g., "resource-%d") is
present and unique in state.BoskosResources after concurrent updates: build the
expected names for i in 0..n-1, iterate state.BoskosResources to collect actual
IDs, verify the sets are equal and that there are no duplicates (or that every
expected "resource-%d" exists), and fail the test with a clear message if any
expected ID is missing or duplicated.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 71ba71fb-3f58-4a74-9059-bad39ceae340
📒 Files selected for processing (3)
dev/tools/go.moddev/tools/resourcectl/main.godev/tools/resourcectl/main_test.go
There was a problem hiding this comment.
🧹 Nitpick comments (1)
dev/tools/resourcectl/main.go (1)
218-220: ⚡ Quick winUse
%wverb for error wrapping consistency.Per coding guidelines, errors should be wrapped with
%wto preserve the error chain forerrors.Is/errors.As.Suggested fix
if err := fileLock.Lock(); err != nil { - return fmt.Errorf("error acquiring lock: %v", err) + return fmt.Errorf("error acquiring lock: %w", err) }🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@dev/tools/resourcectl/main.go` around lines 218 - 220, Replace the fmt.Errorf call that returns the lock acquisition error to use the %w verb so the original error is wrapped (preserving error chain); specifically update the return in the block that checks fileLock.Lock() (the code that currently does return fmt.Errorf("error acquiring lock: %v", err)) to return fmt.Errorf("error acquiring lock: %w", err) instead.Source: Coding guidelines
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@dev/tools/resourcectl/main.go`:
- Around line 218-220: Replace the fmt.Errorf call that returns the lock
acquisition error to use the %w verb so the original error is wrapped
(preserving error chain); specifically update the return in the block that
checks fileLock.Lock() (the code that currently does return fmt.Errorf("error
acquiring lock: %v", err)) to return fmt.Errorf("error acquiring lock: %w", err)
instead.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6ee71796-a1f1-4540-86c2-ec2596cd7021
📒 Files selected for processing (2)
dev/tools/resourcectl/main.godev/tools/resourcectl/main_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
- dev/tools/resourcectl/main_test.go
Use %w instead of %v when wrapping the flock acquire error in updateState so the underlying error is preserved in the chain for errors.Is / errors.As, per the project error-wrapping guideline. Addresses review feedback from @janetkuo.
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aditya-shantanu, janetkuo The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
…ubernetes-sigs#934) * Fix concurrency race condition in resourcectl local state management #### What this PR does / why we need it: This PR fixes a concurrency race condition in the local state management of the `resourcectl` CLI utility (`dev/tools/resourcectl/main.go`). Previously, multiple instances of `resourcectl get` running concurrently would read the same `~/.local/resourcectl/state.json` file and overwrite each other's additions, leading to permanent loss of tracked Boskos resources and resource leaks as their heartbeat processes ran indefinitely. This change introduces file-based locking around the read-modify-write cycle of the state file using `github.com/gofrs/flock` to guarantee atomicity and prevent data loss. #### Which issue(s) this PR is related to: Fixes https://b.corp.google.com/issues/511322627 #### Release Note ```release-note NONE ``` * resourcectl: bound state-lock hold time and fix flock go.mod entry Address PR review feedback: - runCleanup: split into two phases so the state-file lock is no longer held across the slow kill/release calls (network I/O to Boskos). Phase 1 atomically takes ownership of the tracked resources and clears the state under the lock; phase 2 kills heartbeats and releases from Boskos without the lock; phase 3 re-acquires the lock and re-adds only the resources that failed to release, re-reading state so concurrently added resources are not clobbered. - runGet: acquire the state lock before starting the heartbeat process so a process blocked on the lock never leaves an orphaned heartbeat if it is interrupted before persisting the resource. - Add updateState helper for the locked read-modify-write cycle shared by runGet and runCleanup. - Add TestConcurrentStateUpdates exercising concurrent state updates under contention and asserting no entries are lost. - go.mod: promote github.com/gofrs/flock from indirect to a direct dependency (go mod tidy), fixing presubmit-test-autogen-up-to-date. * Address PR review comments * fix formatting issue * resourcectl: best-effort Boskos release on state-write failure; use t.Setenv in tests * resourcectl: tolerate already-exited heartbeat in cleanup; fix flaky helper sleep * resourcectl: wrap lock-acquire error with %w Use %w instead of %v when wrapping the flock acquire error in updateState so the underlying error is preserved in the chain for errors.Is / errors.As, per the project error-wrapping guideline. Addresses review feedback from @janetkuo. --------- Co-authored-by: Aditya Shantanu <aditya-shantanu@users.noreply.github.com>
What this PR does / why we need it:
This PR fixes a concurrency race condition in the local state management of the
resourcectlCLI utility (dev/tools/resourcectl/main.go). Previously, multiple instances ofresourcectl getrunning concurrently would read the same~/.local/resourcectl/state.jsonfile and overwrite each other's additions, leading to permanent loss of tracked Boskos resources and resource leaks as their heartbeat processes ran indefinitely. This change introduces file-based locking around the read-modify-write cycle of the state file usinggithub.com/gofrs/flockto guarantee atomicity and prevent data loss.Which issue(s) this PR is related to:
Fixes https://b.corp.google.com/issues/511322627
Release Note
Summary by CodeRabbit
Bug Fixes
Tests