Skip to content

metric: add Prometheus lifecycle metrics for direct Sandbox creation#1076

Open
aegeiger wants to merge 3 commits into
kubernetes-sigs:mainfrom
aegeiger:feature/controller-metrics-for-sandbox-ready
Open

metric: add Prometheus lifecycle metrics for direct Sandbox creation#1076
aegeiger wants to merge 3 commits into
kubernetes-sigs:mainfrom
aegeiger:feature/controller-metrics-for-sandbox-ready

Conversation

@aegeiger

@aegeiger aegeiger commented Jul 1, 2026

Copy link
Copy Markdown

What this PR does / why we need it:

Sandbox lifecycle metrics (agent_sandbox_creation_latency_ms, claim
startup latency, etc.) were recorded exclusively by the SandboxClaim
extension controller. Sandboxes created directly — without a
SandboxClaim — only appeared in the point-in-time agent_sandboxes
gauge with no creation latency or ready latency telemetry.

This PR moves agent_sandbox_creation_latency_ms recording into the
core SandboxReconciler and adds a new agent_sandbox_ready_latency_ms
histogram so that all Sandboxes get lifecycle telemetry regardless of
creation path.

Existing metric (moved):

  • agent_sandbox_creation_latency_ms — now recorded by the core
    SandboxReconciler instead of SandboxClaimReconciler. The label set
    is unchanged (namespace, launch_type, sandbox_template) to avoid
    breaking existing consumers. It now fires for all Sandboxes.

New metric:

  • agent_sandbox_ready_latency_ms — histogram measuring latency from
    controller first observed Sandbox to Ready state. Uses a high-precision
    annotation (sandbox-controller-first-observed-at, nanosecond-level)
    rather than CreationTimestamp (second-level precision), partially
    addressing [Bug] agent_sandbox_creation_latency_ms lacks millisecond precision due to use of CreationTimestamp #781. Labels: namespace, launch_type,
    sandbox_template, owned_by. Users who need per-owner latency
    breakdown should prefer this metric over creation_latency_ms.

A separate _total counter is intentionally omitted because the
histogram's built-in _count provides equivalent creation counts.

Annotations:

The controller stamps two annotations on the Sandbox:

  • agents.x-k8s.io/sandbox-controller-first-observed-at — stamped on
    first reconcile, used as the start time for the ready latency metric.
  • agents.x-k8s.io/sandbox-first-ready-at — stamped when the Sandbox
    first reaches Ready. Prevents duplicate metric recording on re-Ready
    events (e.g. readiness probe flaps) and provides operators with a
    precise timestamp of when the Sandbox first became operational.

Other changes:

  • recordSandboxCreationLatency is removed from SandboxClaimReconciler
    since the core controller now handles it for all Sandboxes.

AI disclosure: This PR was written in part with the assistance of
generative AI.

Which issue(s) this PR is related to:

Fixes #686
Ref #781

Release Note

Added `agent_sandbox_ready_latency_ms` histogram to the core Sandbox
controller, providing high-precision lifecycle metrics for all Sandboxes
including directly-created ones. Moved `agent_sandbox_creation_latency_ms`
recording from the SandboxClaim controller to the core Sandbox controller
so it fires for all creation paths. No breaking changes to existing metrics.

<!-- This is an auto-generated comment: release notes by coderabbit.ai -->

## Summary by CodeRabbit

* **New Features**
  * Added sandbox observability tracking and a new readiness-latency metric.
  * Sandboxes now get first-observed and first-ready timestamps recorded automatically.
  * Metrics now include ownership-based labeling for better lifecycle insights.

* **Bug Fixes**
  * Prevented duplicate readiness and creation-latency metric reporting across repeated reconciles.
  * Improved handling when observability timestamps are missing or malformed.
  * Ensured readiness metrics are only recorded on the first successful Ready transition.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
@kubernetes-prow kubernetes-prow Bot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Jul 1, 2026
@github-project-automation github-project-automation Bot moved this to Backlog in Agent Sandbox Jul 1, 2026
@netlify

netlify Bot commented Jul 1, 2026

Copy link
Copy Markdown

Deploy Preview for agent-sandbox canceled.

Name Link
🔨 Latest commit 1372af4
🔍 Latest deploy log https://app.netlify.com/projects/agent-sandbox/deploys/6a4622ce313fa80008c754e6
@kubernetes-prow

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: aegeiger
Once this PR has been reviewed and has the lgtm label, please assign soltysh for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubernetes-prow kubernetes-prow Bot requested review from janetkuo and justinsb July 1, 2026 23:05
@linux-foundation-easycla

linux-foundation-easycla Bot commented Jul 1, 2026

Copy link
Copy Markdown

CLA Signed
The committers listed above are authorized under a signed CLA.

  • ✅ login: aegeiger / name: Eitan Geiger (a1eb5e5)

@kubernetes-prow

Copy link
Copy Markdown

Welcome @aegeiger!

It looks like this is your first PR to kubernetes-sigs/agent-sandbox 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/agent-sandbox has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@kubernetes-prow kubernetes-prow Bot added needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jul 1, 2026
@kubernetes-prow

Copy link
Copy Markdown

Hi @aegeiger. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kubernetes-prow kubernetes-prow Bot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jul 1, 2026
@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown

Review Change Stack

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Review skipped — only excluded labels are configured. (3)
  • needs-ok-to-test
  • do-not-merge/work-in-progress
  • cncf-cla: no

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: dd4c2799-f79a-4e2d-bbaf-03b06880a589

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds Sandbox controller annotation stamping (trace context, observability timestamp) and first-Ready lifecycle metric recording (creation and ready latency) with ownership labeling, introduces a SandboxReadyLatency Prometheus histogram, and refactors SandboxClaim controller's creation-latency recording into a Ready-transition-gated method.

Changes

Sandbox lifecycle metrics

Layer / File(s) Summary
Metric constants, histogram, and record function
internal/metrics/metrics.go, internal/metrics/metrics_test.go
Adds SandboxObservabilityAnnotation, SandboxFirstReadyAnnotation, owned-by constants, SandboxReadyLatency histogram (registered in init), RecordSandboxReadyLatency, and a test verifying single-sample recording across label combinations.
Annotation stamping during reconcile
controllers/sandbox_controller.go
Reconcile patches missing trace-context and observability annotations on non-deleting Sandboxes and invokes recordSandboxCreationMetrics after a successful status update.
recordSandboxCreationMetrics and resolveOwnedBy
controllers/sandbox_controller.go
New method computes and records creation/ready latencies on first Ready transition, guards against duplicate recording via SandboxFirstReadyAnnotation, and a new helper maps owner references to ownership labels.
Sandbox controller tests
controllers/sandbox_controller_test.go
Adds testutil/extensionsv1beta1 imports and tests for creation/ready metrics, already-ready suppression, observability annotation stamping, resolveOwnedBy, and first-ready annotation stamping.
SandboxClaim controller latency refactor
extensions/controllers/sandboxclaim_controller.go
Replaces the old latency helper with recordCreationLatencyMetric, gated on new Ready transitions, draining observed-time entries, and removing the prior sandbox creation latency call.

Estimated code review effort: 4 (Complex) | ~60 minutes

Sequence Diagram(s)

sequenceDiagram
    participant Reconciler as SandboxReconciler
    participant Sandbox as Sandbox object
    participant Metrics as internal/metrics

    Reconciler->>Sandbox: Check TraceContextAnnotation, SandboxObservabilityAnnotation
    alt annotations missing
        Reconciler->>Sandbox: Patch trace/observability annotations
    end
    Reconciler->>Sandbox: Update status
    alt status update succeeded and not deleting
        Reconciler->>Reconciler: recordSandboxCreationMetrics()
        alt Ready condition newly True and SandboxFirstReadyAnnotation absent
            Reconciler->>Sandbox: resolveOwnedBy(ownerRef)
            Reconciler->>Metrics: RecordSandboxCreationLatency(...)
            Reconciler->>Metrics: RecordSandboxReadyLatency(...)
            Reconciler->>Sandbox: Patch SandboxFirstReadyAnnotation
        end
    end
Loading

Possibly related issues

Suggested labels: ok-to-test

Suggested reviewers: janetkuo, justinsb

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning The PR covers lifecycle latency and readiness metrics, but it misses requested direct creation and terminal counters, so #686 is not fully satisfied. Add the remaining direct-path metrics from #686, especially direct creation_total and terminal outcome tracking, or narrow the issue scope.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed Title is concise and accurately summarizes the main change: adding Prometheus lifecycle metrics for direct Sandbox creation.
Out of Scope Changes check ✅ Passed All changes relate to Sandbox lifecycle metrics, annotations, controller logic, or tests; no unrelated scope is apparent.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Description check ✅ Passed The PR description follows the required template and clearly covers purpose, linked issues, and a release note.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@kubernetes-prow kubernetes-prow Bot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jul 1, 2026
Add sandbox-level lifecycle metrics to the core SandboxReconciler so
that directly-created Sandboxes (without a SandboxClaim) get the same
observability as claim-backed ones.

Two metrics are now recorded by the core controller on the first
Ready=True transition:

- agent_sandbox_creation_latency_ms (existing, moved from claim
  controller): measures Sandbox creation to Pod Ready latency. The
  label set is unchanged (namespace, launch_type, sandbox_template) to
  avoid breaking existing consumers. It now fires for all Sandboxes
  regardless of creation path.

- agent_sandbox_ready_latency_ms (new histogram): measures latency
  from controller first observed Sandbox to Ready state using a
  high-precision annotation (nanosecond-level), partially addressing
  the millisecond precision gap in CreationTimestamp. Carries
  launch_type, sandbox_template, and owned_by labels. Users who need
  per-owner latency breakdown should prefer this metric.

A separate _total counter is intentionally omitted because the
histogram _count provides equivalent creation counts.

The controller stamps two annotations on the Sandbox:

- sandbox-controller-first-observed-at: stamped on first reconcile,
  used as start time for the ready latency metric.
- sandbox-first-ready-at: stamped when the Sandbox first reaches
  Ready. Prevents duplicate metric recording on re-Ready events and
  provides operators with a precise first-ready timestamp.

The recordSandboxCreationLatency method is removed from the
SandboxClaimReconciler since the core controller now handles it.

Signed-off-by: E Geiger <egeiger@redhat.com>
@aegeiger aegeiger force-pushed the feature/controller-metrics-for-sandbox-ready branch from a1eb5e5 to 99b9940 Compare July 1, 2026 23:15
@kubernetes-prow kubernetes-prow Bot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Jul 1, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
controllers/sandbox_controller_test.go (1)

3541-3569: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Assert ready-latency count after the first reconcile too.

The first-reconcile assertion (Line 3553) only checks SandboxCreationLatency count, not SandboxReadyLatency. Since this test specifically exercises the "already ready" duplicate-recording guard for both new metrics, asserting testutil.CollectAndCount(asmetrics.SandboxReadyLatency) equals 1 right after the first reconcile would close a gap in verifying the ready-latency recording path for this scenario.

✅ Suggested addition
 	_, err := r.Reconcile(t.Context(), req)
 	require.NoError(t, err)
 	assert.Equal(t, 1, testutil.CollectAndCount(asmetrics.SandboxCreationLatency), "first reconcile should record creation latency")
+	assert.Equal(t, 1, testutil.CollectAndCount(asmetrics.SandboxReadyLatency), "first reconcile should record ready latency")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@controllers/sandbox_controller_test.go` around lines 3541 - 3569, The first
reconcile in SandboxReconciler.Reconcile only verifies SandboxCreationLatency,
leaving the SandboxReadyLatency recording path untested for the initial Ready
transition. Update the test to assert that
testutil.CollectAndCount(asmetrics.SandboxReadyLatency) is 1 immediately after
the first Reconcile call, alongside the existing creation-latency check, so both
first-ready metrics are covered before resetting them for the
duplicate-reconcile guard.
internal/metrics/metrics.go (1)

88-110: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Duplicate bucket literal.

The Buckets slice for SandboxReadyLatency is an exact copy of SandboxCreationLatency's. Extracting a shared var avoids drift if bucket boundaries are tuned later.

♻️ Proposed refactor
+// standardLatencyBuckets are used for millisecond-scale lifecycle latency histograms
+// (50ms to 10 minutes).
+var standardLatencyBuckets = []float64{50, 100, 250, 500, 1000, 2500, 5000, 10000, 30000, 60000, 120000, 240000, 300000, 600000}
+
 SandboxCreationLatency = prometheus.NewHistogramVec(
 	prometheus.HistogramOpts{
 		Name: "agent_sandbox_creation_latency_ms",
 		Help: "Latency from Sandbox creation to Pod Ready state in milliseconds.",
-		// Buckets for latency from 50ms to 10 minutes
-		Buckets: []float64{50, 100, 250, 500, 1000, 2500, 5000, 10000, 30000, 60000, 120000, 240000, 300000, 600000},
+		Buckets: standardLatencyBuckets,
 	},
 	[]string{"namespace", "launch_type", "sandbox_template"},
 )

 SandboxReadyLatency = prometheus.NewHistogramVec(
 	prometheus.HistogramOpts{
 		Name: "agent_sandbox_ready_latency_ms",
 		Help: "Latency from controller first observed Sandbox to Ready state in milliseconds.",
-		// Buckets for latency from 50ms to 10 minutes
-		Buckets: []float64{50, 100, 250, 500, 1000, 2500, 5000, 10000, 30000, 60000, 120000, 240000, 300000, 600000},
+		Buckets: standardLatencyBuckets,
 	},
 	[]string{"namespace", "launch_type", "sandbox_template", "owned_by"},
 )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@internal/metrics/metrics.go` around lines 88 - 110, The `SandboxReadyLatency`
histogram in `metrics.go` duplicates the same bucket slice used by
`SandboxCreationLatency`; extract the shared bucket boundaries into a single
package-level variable and have both histogram definitions reference it. Update
the `SandboxCreationLatency` and `SandboxReadyLatency` declarations to use the
shared bucket constant so future bucket changes stay consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@controllers/sandbox_controller.go`:
- Around line 232-233: The first-ready handling in sandbox reconciliation is
marking creation metrics as done before the durable idempotency marker is safely
written. In the code around recordSandboxCreationMetrics and the
SandboxFirstReadyAnnotation stamping path, make sure the annotation is attempted
before treating the event as recorded, and do not return early after emitting
creation latency if the marker still needs to be set. Also stop swallowing stamp
failures: propagate them with context so reconcile retries remain idempotent and
safe.

---

Nitpick comments:
In `@controllers/sandbox_controller_test.go`:
- Around line 3541-3569: The first reconcile in SandboxReconciler.Reconcile only
verifies SandboxCreationLatency, leaving the SandboxReadyLatency recording path
untested for the initial Ready transition. Update the test to assert that
testutil.CollectAndCount(asmetrics.SandboxReadyLatency) is 1 immediately after
the first Reconcile call, alongside the existing creation-latency check, so both
first-ready metrics are covered before resetting them for the
duplicate-reconcile guard.

In `@internal/metrics/metrics.go`:
- Around line 88-110: The `SandboxReadyLatency` histogram in `metrics.go`
duplicates the same bucket slice used by `SandboxCreationLatency`; extract the
shared bucket boundaries into a single package-level variable and have both
histogram definitions reference it. Update the `SandboxCreationLatency` and
`SandboxReadyLatency` declarations to use the shared bucket constant so future
bucket changes stay consistent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7e1f821b-d23f-467a-9185-ba13c758c054

📥 Commits

Reviewing files that changed from the base of the PR and between da3b568 and a1eb5e5.

📒 Files selected for processing (5)
  • controllers/sandbox_controller.go
  • controllers/sandbox_controller_test.go
  • extensions/controllers/sandboxclaim_controller.go
  • internal/metrics/metrics.go
  • internal/metrics/metrics_test.go
💤 Files with no reviewable changes (1)
  • extensions/controllers/sandboxclaim_controller.go
Comment thread controllers/sandbox_controller.go Outdated

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends core SandboxReconciler observability so that lifecycle latency telemetry is emitted for all Sandboxes (including those created directly, without a SandboxClaim), by moving agent_sandbox_creation_latency_ms into the core controller and adding a new agent_sandbox_ready_latency_ms histogram based on controller-stamped timestamps.

Changes:

  • Added new core histogram agent_sandbox_ready_latency_ms (labels: namespace, launch_type, sandbox_template, owned_by) and registered it in the metrics registry.
  • Stamped Sandbox-level observability annotations from the core controller (sandbox-controller-first-observed-at, sandbox-first-ready-at) and recorded creation/ready latency metrics on the first Ready transition.
  • Removed creation-latency recording from the SandboxClaimReconciler since it is now emitted from the core controller.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
internal/metrics/metrics.go Adds new Sandbox-ready histogram + ownership label constants; registers metric; adds record helper.
internal/metrics/metrics_test.go Adds unit test coverage for the new ready-latency recorder helper.
extensions/controllers/sandboxclaim_controller.go Removes now-redundant Sandbox creation-latency recording from the claim controller.
controllers/sandbox_controller.go Stamps new Sandbox observability annotations and records lifecycle metrics from the core reconciler.
controllers/sandbox_controller_test.go Adds reconciler tests around stamping + metric emission behavior for direct/owned Sandboxes.
Comment thread controllers/sandbox_controller.go
Comment thread controllers/sandbox_controller_test.go
Propagate annotation patch errors from recordSandboxCreationMetrics so
that reconcile retries if the first-ready annotation fails to persist.
The retry is safe because the status already has Ready=True persisted,
so the oldReady guard prevents duplicate metric recording.

Fix early return on malformed observability annotation that skipped
stamping the first-ready annotation. Parse errors now skip only the
ready latency metric while still proceeding to stamp the annotation.

Extract shared sandbox latency histogram buckets into a package-level
variable (sandboxLatencyBuckets) so SandboxCreationLatency and
SandboxReadyLatency stay consistent on future bucket changes.

Assert SandboxReadyLatency observation count after the first reconcile
in TestRecordSandboxCreationMetrics_AlreadyReady, alongside the
existing SandboxCreationLatency assertion, so both first-ready metrics
are covered before the duplicate-reconcile guard test.

Signed-off-by: E Geiger <egeiger@redhat.com>
The expectedOwnedBy field in TestRecordSandboxCreationMetrics was
populated in every table case but never asserted, left over from the
removed SandboxCreationTotal counter. Use CurryWith to verify that
SandboxReadyLatency records exactly one observation with the expected
owned_by label value in each case, covering the full reconcile path
from resolveOwnedBy through metric recording.

Signed-off-by: E Geiger <egeiger@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. ready-for-review size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

3 participants