Skip to content

Fix concurrency race condition in resourcectl local state management#934

Merged
k8s-ci-robot merged 7 commits into
kubernetes-sigs:mainfrom
aditya-shantanu:fix-resourcectl-race-condition
Jun 15, 2026
Merged

Fix concurrency race condition in resourcectl local state management#934
k8s-ci-robot merged 7 commits into
kubernetes-sigs:mainfrom
aditya-shantanu:fix-resourcectl-race-condition

Conversation

@aditya-shantanu

@aditya-shantanu aditya-shantanu commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

What this PR does / why we need it:

This PR fixes a concurrency race condition in the local state management of the resourcectl CLI utility (dev/tools/resourcectl/main.go). Previously, multiple instances of resourcectl get running concurrently would read the same ~/.local/resourcectl/state.json file and overwrite each other's additions, leading to permanent loss of tracked Boskos resources and resource leaks as their heartbeat processes ran indefinitely. This change introduces file-based locking around the read-modify-write cycle of the state file using github.com/gofrs/flock to guarantee atomicity and prevent data loss.

Which issue(s) this PR is related to:

Fixes https://b.corp.google.com/issues/511322627

Release Note

NONE

Summary by CodeRabbit

  • Bug Fixes

    • Serialized CLI state access with file locking to prevent corruption during concurrent runs.
    • Background heartbeat is only started after state is persisted to avoid orphaned processes and ensure recorded PIDs are durable.
    • Cleanup now safely handles concurrent updates and re-appends resources that fail to release without overwriting new entries.
  • Tests

    • Added a concurrency test validating safe simultaneous state updates and deduplication.
#### What this PR does / why we need it:
This PR fixes a concurrency race condition in the local state management of the `resourcectl` CLI utility (`dev/tools/resourcectl/main.go`). Previously, multiple instances of `resourcectl get` running concurrently would read the same `~/.local/resourcectl/state.json` file and overwrite each other's additions, leading to permanent loss of tracked Boskos resources and resource leaks as their heartbeat processes ran indefinitely. This change introduces file-based locking around the read-modify-write cycle of the state file using `github.com/gofrs/flock` to guarantee atomicity and prevent data loss.

#### Which issue(s) this PR is related to:
Fixes https://b.corp.google.com/issues/511322627

#### Release Note
```release-note
NONE
```
@netlify

netlify Bot commented Jun 4, 2026

Copy link
Copy Markdown

Deploy Preview for agent-sandbox canceled.

Name Link
🔨 Latest commit d773499
🔍 Latest deploy log https://app.netlify.com/projects/agent-sandbox/deploys/6a2a271ef494c7000838e457
@k8s-ci-robot k8s-ci-robot requested review from janetkuo and justinsb June 4, 2026 15:51
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 4, 2026
@coderabbitai

coderabbitai Bot commented Jun 4, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Add file-based locking to serialize state.json access: introduce a lock-file helper and updateState(fn) that acquires a flock for guarded read-modify-write, refactor runGet/runCleanup to use the helper, add a concurrent test, and promote flock to a direct go.mod dependency.

Changes

State file locking via flock

Layer / File(s) Summary
Module dependency bump
dev/tools/go.mod
Promote github.com/gofrs/flock to a direct requirement in the tools module.
Lock infrastructure setup
dev/tools/resourcectl/main.go
Imported github.com/gofrs/flock and added stateLockFilePath() to derive state.lock from the state directory.
Guarded state update helper
dev/tools/resourcectl/main.go
Added updateState(fn) which acquires a flock, runs a read-modify-write closure, writes the updated state, and releases the lock.
runGet: persist acquired resource under lock
dev/tools/resourcectl/main.go
Refactored runGet to append new BoskosResources via updateState, starting heartbeats while the lock is held to avoid orphaned heartbeats and attempting cleanup on persistence failure.
runCleanup: snapshot and clear under lock
dev/tools/resourcectl/main.go
Refactored runCleanup to snapshot and clear tracked resources under lock (Phase 1), perform kill/release work outside the lock (Phase 2), and collect failures.
runCleanup: re-persist remaining failures
dev/tools/resourcectl/main.go
After release attempts, re-acquire the lock only to append any remaining resources back into state without clobbering concurrent updates (Phase 3).
Concurrency test for state updates
dev/tools/resourcectl/main_test.go
Add TestConcurrentStateUpdates to run multiple goroutines calling updateState concurrently and assert all appended entries persist; add sync import and use t.Setenv for HOME sandboxing.

🎯 4 (Complex) | ⏱️ ~45 minutes

"I'm a rabbit with a tiny key,
I guard the file so states agree.
Locks in place, no more distress,
Concurrent writes now pass the test. 🐇🔐"

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 72.73% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: introducing file-based locking to fix a concurrency race condition in resourcectl's local state management.
Description check ✅ Passed The description follows the required template with all key sections completed: it clearly explains the problem (concurrent writes causing data loss), the solution (file-based locking with gofrs/flock), the related issue, and the release note.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 4, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@dev/tools/resourcectl/main.go`:
- Around line 319-327: The file lock acquired via stateLockFilePath() and
fileLock.Lock() currently surrounds the entire cleanup kill/release loop and
blocks other state operations; change to a two-phase approach: acquire fileLock
only to read and deserialize state (use fileLock.Lock()/Unlock()), then
immediately release before performing slow external kill/release calls, and
finally re-acquire the lock to apply and persist state changes (use
fileLock.Lock()/Unlock() around the write). Locate usages of
stateLockFilePath(), fileLock, and the cleanup kill/release loop in the cleanup
routine and refactor so external calls run outside the locked section;
re-acquire the same lock only when writing back updated state to disk.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4cb68afc-be34-41f3-9982-5a955c94c191

📥 Commits

Reviewing files that changed from the base of the PR and between 8cff34d and f7a3ab0.

📒 Files selected for processing (1)
  • dev/tools/resourcectl/main.go
Comment thread dev/tools/resourcectl/main.go Outdated

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a real concurrency issue in resourcectl’s local state management by introducing a file lock to protect the read/modify/write cycle of ~/.local/resourcectl/state.json, preventing concurrent get/cleanup invocations from overwriting each other and losing tracked resources.

Changes:

  • Add a dedicated lock file (state.lock) colocated with the state file.
  • Use github.com/gofrs/flock to serialize state updates in runGet and runCleanup.
Comment thread dev/tools/resourcectl/main.go Outdated
Comment thread dev/tools/resourcectl/main.go Outdated
Comment thread dev/tools/resourcectl/main.go Outdated
Address PR review feedback:

- runCleanup: split into two phases so the state-file lock is no longer
  held across the slow kill/release calls (network I/O to Boskos).
  Phase 1 atomically takes ownership of the tracked resources and clears
  the state under the lock; phase 2 kills heartbeats and releases from
  Boskos without the lock; phase 3 re-acquires the lock and re-adds only
  the resources that failed to release, re-reading state so concurrently
  added resources are not clobbered.

- runGet: acquire the state lock before starting the heartbeat process so
  a process blocked on the lock never leaves an orphaned heartbeat if it
  is interrupted before persisting the resource.

- Add updateState helper for the locked read-modify-write cycle shared by
  runGet and runCleanup.

- Add TestConcurrentStateUpdates exercising concurrent state updates under
  contention and asserting no entries are lost.

- go.mod: promote github.com/gofrs/flock from indirect to a direct
  dependency (go mod tidy), fixing presubmit-test-autogen-up-to-date.
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 4, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
dev/tools/resourcectl/main_test.go (1)

125-130: ⚡ Quick win

Assert the resource identities, not just the final count.

A broken write path can still satisfy len(state.BoskosResources) == n while duplicating or corrupting entries. Checking the expected resource-%d set would make this test actually prove that no update was clobbered under contention.

Proposed assertion tightening
 	state, err := readState()
 	if err != nil {
 		t.Fatalf("failed to read state file: %v", err)
 	}
 	if len(state.BoskosResources) != n {
 		t.Errorf("expected %d resources after concurrent updates, but got %d", n, len(state.BoskosResources))
 	}
+	seen := make(map[string]int, len(state.BoskosResources))
+	for _, resource := range state.BoskosResources {
+		seen[resource.Name]++
+	}
+	for i := 0; i < n; i++ {
+		name := fmt.Sprintf("resource-%d", i)
+		if seen[name] != 1 {
+			t.Fatalf("expected exactly one %q entry, got %d", name, seen[name])
+		}
+	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dev/tools/resourcectl/main_test.go` around lines 125 - 130, The test reads
state via readState() and only asserts len(state.BoskosResources) == n, which
can miss duplicates or corrupted entries; update the test to assert the exact
set of resource IDs (e.g., "resource-%d") is present and unique in
state.BoskosResources after concurrent updates: build the expected names for i
in 0..n-1, iterate state.BoskosResources to collect actual IDs, verify the sets
are equal and that there are no duplicates (or that every expected "resource-%d"
exists), and fail the test with a clear message if any expected ID is missing or
duplicated.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@dev/tools/resourcectl/main.go`:
- Around line 302-320: The code starts the heartbeat process (cmd) then appends
a BoskosResource and relies on updateState/writeState to persist it; if
persistence fails the heartbeat is left running. After cmd.Start() and before
returning the error from updateState/writeState failures, ensure best-effort
cleanup: kill the heartbeat process group (use negative PID since
SysProcAttr.Setpgid=true), wait for the process to exit, and if SIGTERM fails
escalate to SIGKILL; log any cleanup errors. Update the runGet/updateState
caller path to perform this cleanup when updateState or writeState returns an
error so no orphaned heartbeat (referenced symbols: cmd, SysProcAttr,
cmd.Process.Pid, BoskosResource, updateState, writeState, heartbeat).

---

Nitpick comments:
In `@dev/tools/resourcectl/main_test.go`:
- Around line 125-130: The test reads state via readState() and only asserts
len(state.BoskosResources) == n, which can miss duplicates or corrupted entries;
update the test to assert the exact set of resource IDs (e.g., "resource-%d") is
present and unique in state.BoskosResources after concurrent updates: build the
expected names for i in 0..n-1, iterate state.BoskosResources to collect actual
IDs, verify the sets are equal and that there are no duplicates (or that every
expected "resource-%d" exists), and fail the test with a clear message if any
expected ID is missing or duplicated.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 71ba71fb-3f58-4a74-9059-bad39ceae340

📥 Commits

Reviewing files that changed from the base of the PR and between f7a3ab0 and 2cbac4f.

📒 Files selected for processing (3)
  • dev/tools/go.mod
  • dev/tools/resourcectl/main.go
  • dev/tools/resourcectl/main_test.go
Comment thread dev/tools/resourcectl/main.go
Comment thread dev/tools/resourcectl/main.go
Comment thread dev/tools/resourcectl/main_test.go Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
dev/tools/resourcectl/main.go (1)

218-220: ⚡ Quick win

Use %w verb for error wrapping consistency.

Per coding guidelines, errors should be wrapped with %w to preserve the error chain for errors.Is/errors.As.

Suggested fix
 	if err := fileLock.Lock(); err != nil {
-		return fmt.Errorf("error acquiring lock: %v", err)
+		return fmt.Errorf("error acquiring lock: %w", err)
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@dev/tools/resourcectl/main.go` around lines 218 - 220, Replace the fmt.Errorf
call that returns the lock acquisition error to use the %w verb so the original
error is wrapped (preserving error chain); specifically update the return in the
block that checks fileLock.Lock() (the code that currently does return
fmt.Errorf("error acquiring lock: %v", err)) to return fmt.Errorf("error
acquiring lock: %w", err) instead.

Source: Coding guidelines

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@dev/tools/resourcectl/main.go`:
- Around line 218-220: Replace the fmt.Errorf call that returns the lock
acquisition error to use the %w verb so the original error is wrapped
(preserving error chain); specifically update the return in the block that
checks fileLock.Lock() (the code that currently does return fmt.Errorf("error
acquiring lock: %v", err)) to return fmt.Errorf("error acquiring lock: %w", err)
instead.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6ee71796-a1f1-4540-86c2-ec2596cd7021

📥 Commits

Reviewing files that changed from the base of the PR and between 4183f1b and 0022f4e.

📒 Files selected for processing (2)
  • dev/tools/resourcectl/main.go
  • dev/tools/resourcectl/main_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • dev/tools/resourcectl/main_test.go
@aditya-shantanu aditya-shantanu requested a review from janetkuo June 9, 2026 20:31
Comment thread dev/tools/resourcectl/main.go Outdated
Use %w instead of %v when wrapping the flock acquire error in
updateState so the underlying error is preserved in the chain for
errors.Is / errors.As, per the project error-wrapping guideline.

Addresses review feedback from @janetkuo.
@aditya-shantanu aditya-shantanu requested a review from janetkuo June 11, 2026 05:21

@janetkuo janetkuo left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 15, 2026
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aditya-shantanu, janetkuo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 15, 2026
@k8s-ci-robot k8s-ci-robot merged commit 2dd5376 into kubernetes-sigs:main Jun 15, 2026
12 checks passed
@github-project-automation github-project-automation Bot moved this from Backlog to Done in Agent Sandbox Jun 15, 2026
@aditya-shantanu aditya-shantanu deleted the fix-resourcectl-race-condition branch June 16, 2026 15:57
alexatakvelon pushed a commit to volatilemolotov/agent-sandbox that referenced this pull request Jun 24, 2026
…ubernetes-sigs#934)

* Fix concurrency race condition in resourcectl local state management

#### What this PR does / why we need it:
This PR fixes a concurrency race condition in the local state management of the `resourcectl` CLI utility (`dev/tools/resourcectl/main.go`). Previously, multiple instances of `resourcectl get` running concurrently would read the same `~/.local/resourcectl/state.json` file and overwrite each other's additions, leading to permanent loss of tracked Boskos resources and resource leaks as their heartbeat processes ran indefinitely. This change introduces file-based locking around the read-modify-write cycle of the state file using `github.com/gofrs/flock` to guarantee atomicity and prevent data loss.

#### Which issue(s) this PR is related to:
Fixes https://b.corp.google.com/issues/511322627

#### Release Note
```release-note
NONE
```

* resourcectl: bound state-lock hold time and fix flock go.mod entry

Address PR review feedback:

- runCleanup: split into two phases so the state-file lock is no longer
  held across the slow kill/release calls (network I/O to Boskos).
  Phase 1 atomically takes ownership of the tracked resources and clears
  the state under the lock; phase 2 kills heartbeats and releases from
  Boskos without the lock; phase 3 re-acquires the lock and re-adds only
  the resources that failed to release, re-reading state so concurrently
  added resources are not clobbered.

- runGet: acquire the state lock before starting the heartbeat process so
  a process blocked on the lock never leaves an orphaned heartbeat if it
  is interrupted before persisting the resource.

- Add updateState helper for the locked read-modify-write cycle shared by
  runGet and runCleanup.

- Add TestConcurrentStateUpdates exercising concurrent state updates under
  contention and asserting no entries are lost.

- go.mod: promote github.com/gofrs/flock from indirect to a direct
  dependency (go mod tidy), fixing presubmit-test-autogen-up-to-date.

* Address PR review comments

* fix formatting issue

* resourcectl: best-effort Boskos release on state-write failure; use t.Setenv in tests

* resourcectl: tolerate already-exited heartbeat in cleanup; fix flaky helper sleep

* resourcectl: wrap lock-acquire error with %w

Use %w instead of %v when wrapping the flock acquire error in
updateState so the underlying error is preserved in the chain for
errors.Is / errors.As, per the project error-wrapping guideline.

Addresses review feedback from @janetkuo.

---------

Co-authored-by: Aditya Shantanu <aditya-shantanu@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ready-for-review size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

4 participants