Skip to content

[RFC] Add claimed label to agent_sandboxes metric#711

Merged
k8s-ci-robot merged 5 commits into
kubernetes-sigs:mainfrom
rainwoodman:feyu/add-claimed-label-to-metrics
May 19, 2026
Merged

[RFC] Add claimed label to agent_sandboxes metric#711
k8s-ci-robot merged 5 commits into
kubernetes-sigs:mainfrom
rainwoodman:feyu/add-claimed-label-to-metrics

Conversation

@rainwoodman

Copy link
Copy Markdown
Contributor

Problem Statement

Currently, the agent_sandboxes metric does not distinguish between sandboxes that are actively claimed by a user/agent and those that are sitting idle in the warm pool. This makes it difficult to track actual utilization of the sandbox pool from metrics alone.

Proposed Solution (RFC)

To provide a more reliable and explicit way to track this state, this PR proposes adding a claimed="true|false" label to the agent_sandboxes metric.

Instead of relying on annotations, this implementation checks the OwnerReferences of the Sandbox resource. If the Sandbox is controlled by a SandboxClaim, it is marked as claimed="true". Otherwise, it defaults to "false".

This approach keeps the metric cardinality low (only two values for claimed) while providing high-value visibility into the actual utilization of the sandbox pool.

Intention

This PR is submitted as an RFC (Request for Comments). While the code is functional and includes unit tests, the primary goal is to start a discussion on the best way to model this state in our metrics and to seek general advice on actionable metrics for the Sandbox system.

Specifically, we are interested in metrics that can help answer operational questions such as:

  • Is the warm pool healthy? (e.g., detecting unusable or stuck sandboxes in the warm pool).
  • Do we need to upsize the sandbox serving cluster? (e.g., tracking current utilization relative to max nodes limit).
  • Corroborating upstream systems: Providing numbers on concurrent sandboxes actively being used (claimed) to compare with upstream expectations.

We are also looking for feedback on:

  • Whether a boolean claimed label is the right abstraction for this specific metric.
  • What other metrics would be valuable to make the system more observable and operable (e.g., warm pool depletion rate, detailed breakdown of startup latency stages).
  • If this aligns with the project's long-term observability goals.

We hope this sparks some inspiration and discussion!

Changes Included

  • internal/metrics/metrics.go: Added "claimed" to AgentSandboxesDesc labels.
  • internal/metrics/sandbox_collector.go: Updated Collect method to check OwnerReferences and set the claimed label.
  • internal/metrics/sandbox_collector_test.go: Added test cases to verify the new label behavior.

Verification Results

  • Unit tests passed: go test -v ./internal/metrics/...
  • Linter passed with 0 issues.
@netlify

netlify Bot commented Apr 29, 2026

Copy link
Copy Markdown

Deploy Preview for agent-sandbox ready!

Name Link
🔨 Latest commit a42f807
🔍 Latest deploy log https://app.netlify.com/projects/agent-sandbox/deploys/6a039add9806410008c3531f
😎 Deploy Preview https://deploy-preview-711--agent-sandbox.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot requested review from igooch and justinsb April 29, 2026 00:34
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

Welcome @rainwoodman!

It looks like this is your first PR to kubernetes-sigs/agent-sandbox 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/agent-sandbox has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 29, 2026
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

Hi @rainwoodman. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Apr 29, 2026
@janetkuo janetkuo requested a review from Copilot April 29, 2026 01:00

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This RFC improves observability of sandbox utilization by extending the existing agent_sandboxes Prometheus metric to explicitly distinguish claimed (in-use) sandboxes from unclaimed ones, using OwnerReferences to detect Sandboxes controlled by a SandboxClaim.

Changes:

  • Add a claimed="true|false" label to the agent_sandboxes metric descriptor.
  • Populate claimed during collection by checking the Sandbox controller OwnerReference Kind.
  • Extend unit tests to assert the new label behavior, including a claimed Sandbox case.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
internal/metrics/metrics.go Adds claimed to the agent_sandboxes descriptor label set.
internal/metrics/sandbox_collector.go Computes claimed based on controller OwnerReference and includes it in metric key/labels.
internal/metrics/sandbox_collector_test.go Updates existing expectations to include claimed and adds a claimed-sandbox test case.
Comment thread internal/metrics/sandbox_collector_test.go Outdated
Comment thread internal/metrics/sandbox_collector_test.go Outdated
Comment thread internal/metrics/metrics.go
Comment thread internal/metrics/sandbox_collector.go Outdated
@janetkuo janetkuo added action-required: resolve-copilot-comments ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 29, 2026
@janetkuo janetkuo removed this from Agent Sandbox Apr 29, 2026
@janetkuo janetkuo removed this from Agent Sandbox Apr 29, 2026

@igooch igooch left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple comments, overall the PR looks look and is an improvement on the current metric.

Comment thread internal/metrics/sandbox_collector.go Outdated
Comment thread internal/metrics/sandbox_collector.go Outdated
Goal:
The goal of this change is to add a claimed label to the agent_sandboxes
metric to distinguish claimed sandboxes from those in the warm pool. This
helps in monitoring and understanding the state of sandboxes in the cluster
more accurately.

Summary of Changes:
- Updated internal/metrics/metrics.go to include the claimed label in the
  AgentSandboxesDesc descriptor.
- Updated internal/metrics/sandbox_collector.go to populate the claimed
  label by checking the OwnerReferences of each Sandbox. If a Sandbox is
  owned by a SandboxClaim, it is marked as claimed ("true"), otherwise
  "false".
- Updated internal/metrics/sandbox_collector_test.go to cover the new
  label, including a specific test case for claimed sandboxes.

Verification Results:
- Unit tests passed successfully: go test -v ./internal/metrics/...
- Linter passed with 0 issues: ./dev/tools/lint-go
We have addressed the review comments from igooch on PR 711.

Goal of Changes:
The goal was to refactor the sandbox metrics collector to:
1. Avoid hardcoded API versions by dynamically retrieving them from the
   extensions API package.
2. Improve code readability by using a switch statement instead of nested
   if/else blocks.
3. Ensure consistent naming for the warm pool owned_by metric label value
   (SandboxWarmPool instead of Warmpool).

Changes Made:

- Sandbox Collector Refactoring:
  Imported sigs.k8s.io/agent-sandbox/extensions/api/v1alpha1 and used it to
  dynamically determine the API version. Refactored the owner reference check
  to use a switch statement and updated the warm pool label value.

- Metrics Documentation Update:
  Updated the documentation comments for the agent_sandboxes metric descriptor
  to reflect the new SandboxWarmPool label value.

- Test Updates:
  Updated the unit tests to assert the new SandboxWarmPool label value instead
  of Warmpool.

Verification:
Ran the metrics package tests and verified they all passed:
go test -v ./internal/metrics/...
@rainwoodman rainwoodman force-pushed the feyu/add-claimed-label-to-metrics branch from 13438b8 to a42f807 Compare May 12, 2026 21:25
@rainwoodman rainwoodman requested a review from igooch May 12, 2026 21:26

@igooch igooch left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 12, 2026

@igooch igooch left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 19, 2026
@k8s-ci-robot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: igooch, rainwoodman

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 6fbb999 into kubernetes-sigs:main May 19, 2026
11 checks passed
@github-project-automation github-project-automation Bot moved this from Backlog to Done in Agent Sandbox May 19, 2026
khirotaka pushed a commit to khirotaka/agent-sandbox that referenced this pull request Jun 12, 2026
* metrics: add claimed label to agent_sandboxes metric

Goal:
The goal of this change is to add a claimed label to the agent_sandboxes
metric to distinguish claimed sandboxes from those in the warm pool. This
helps in monitoring and understanding the state of sandboxes in the cluster
more accurately.

Summary of Changes:
- Updated internal/metrics/metrics.go to include the claimed label in the
  AgentSandboxesDesc descriptor.
- Updated internal/metrics/sandbox_collector.go to populate the claimed
  label by checking the OwnerReferences of each Sandbox. If a Sandbox is
  owned by a SandboxClaim, it is marked as claimed ("true"), otherwise
  "false".
- Updated internal/metrics/sandbox_collector_test.go to cover the new
  label, including a specific test case for claimed sandboxes.

Verification Results:
- Unit tests passed successfully: go test -v ./internal/metrics/...
- Linter passed with 0 issues: ./dev/tools/lint-go

* metrics: replace claimed label with owned_by on agent_sandboxes metric

* metrics: address Copilot comments on PR 711

* metrics: fix linter issues on PR 711

* Address PR 711 review comments from igooch

We have addressed the review comments from igooch on PR 711.

Goal of Changes:
The goal was to refactor the sandbox metrics collector to:
1. Avoid hardcoded API versions by dynamically retrieving them from the
   extensions API package.
2. Improve code readability by using a switch statement instead of nested
   if/else blocks.
3. Ensure consistent naming for the warm pool owned_by metric label value
   (SandboxWarmPool instead of Warmpool).

Changes Made:

- Sandbox Collector Refactoring:
  Imported sigs.k8s.io/agent-sandbox/extensions/api/v1alpha1 and used it to
  dynamically determine the API version. Refactored the owner reference check
  to use a switch statement and updated the warm pool label value.

- Metrics Documentation Update:
  Updated the documentation comments for the agent_sandboxes metric descriptor
  to reflect the new SandboxWarmPool label value.

- Test Updates:
  Updated the unit tests to assert the new SandboxWarmPool label value instead
  of Warmpool.

Verification:
Ran the metrics package tests and verified they all passed:
go test -v ./internal/metrics/...
alexatakvelon pushed a commit to volatilemolotov/agent-sandbox that referenced this pull request Jun 24, 2026
* metrics: add claimed label to agent_sandboxes metric

Goal:
The goal of this change is to add a claimed label to the agent_sandboxes
metric to distinguish claimed sandboxes from those in the warm pool. This
helps in monitoring and understanding the state of sandboxes in the cluster
more accurately.

Summary of Changes:
- Updated internal/metrics/metrics.go to include the claimed label in the
  AgentSandboxesDesc descriptor.
- Updated internal/metrics/sandbox_collector.go to populate the claimed
  label by checking the OwnerReferences of each Sandbox. If a Sandbox is
  owned by a SandboxClaim, it is marked as claimed ("true"), otherwise
  "false".
- Updated internal/metrics/sandbox_collector_test.go to cover the new
  label, including a specific test case for claimed sandboxes.

Verification Results:
- Unit tests passed successfully: go test -v ./internal/metrics/...
- Linter passed with 0 issues: ./dev/tools/lint-go

* metrics: replace claimed label with owned_by on agent_sandboxes metric

* metrics: address Copilot comments on PR 711

* metrics: fix linter issues on PR 711

* Address PR 711 review comments from igooch

We have addressed the review comments from igooch on PR 711.

Goal of Changes:
The goal was to refactor the sandbox metrics collector to:
1. Avoid hardcoded API versions by dynamically retrieving them from the
   extensions API package.
2. Improve code readability by using a switch statement instead of nested
   if/else blocks.
3. Ensure consistent naming for the warm pool owned_by metric label value
   (SandboxWarmPool instead of Warmpool).

Changes Made:

- Sandbox Collector Refactoring:
  Imported sigs.k8s.io/agent-sandbox/extensions/api/v1alpha1 and used it to
  dynamically determine the API version. Refactored the owner reference check
  to use a switch statement and updated the warm pool label value.

- Metrics Documentation Update:
  Updated the documentation comments for the agent_sandboxes metric descriptor
  to reflect the new SandboxWarmPool label value.

- Test Updates:
  Updated the unit tests to assert the new SandboxWarmPool label value instead
  of Warmpool.

Verification:
Ran the metrics package tests and verified they all passed:
go test -v ./internal/metrics/...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. ready-for-review size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

5 participants