metric: add Prometheus lifecycle metrics for direct Sandbox creation#1076
metric: add Prometheus lifecycle metrics for direct Sandbox creation#1076aegeiger wants to merge 3 commits into
Conversation
✅ Deploy Preview for agent-sandbox canceled.
|
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: aegeiger The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
|
|
Welcome @aegeiger! |
|
Hi @aegeiger. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Important Review skippedAuto reviews are limited based on label configuration. 🚫 Review skipped — only excluded labels are configured. (3)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughAdds Sandbox controller annotation stamping (trace context, observability timestamp) and first-Ready lifecycle metric recording (creation and ready latency) with ownership labeling, introduces a SandboxReadyLatency Prometheus histogram, and refactors SandboxClaim controller's creation-latency recording into a Ready-transition-gated method. ChangesSandbox lifecycle metrics
Estimated code review effort: 4 (Complex) | ~60 minutes Sequence Diagram(s)sequenceDiagram
participant Reconciler as SandboxReconciler
participant Sandbox as Sandbox object
participant Metrics as internal/metrics
Reconciler->>Sandbox: Check TraceContextAnnotation, SandboxObservabilityAnnotation
alt annotations missing
Reconciler->>Sandbox: Patch trace/observability annotations
end
Reconciler->>Sandbox: Update status
alt status update succeeded and not deleting
Reconciler->>Reconciler: recordSandboxCreationMetrics()
alt Ready condition newly True and SandboxFirstReadyAnnotation absent
Reconciler->>Sandbox: resolveOwnedBy(ownerRef)
Reconciler->>Metrics: RecordSandboxCreationLatency(...)
Reconciler->>Metrics: RecordSandboxReadyLatency(...)
Reconciler->>Sandbox: Patch SandboxFirstReadyAnnotation
end
end
Possibly related issues
Suggested labels: Suggested reviewers: 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Add sandbox-level lifecycle metrics to the core SandboxReconciler so that directly-created Sandboxes (without a SandboxClaim) get the same observability as claim-backed ones. Two metrics are now recorded by the core controller on the first Ready=True transition: - agent_sandbox_creation_latency_ms (existing, moved from claim controller): measures Sandbox creation to Pod Ready latency. The label set is unchanged (namespace, launch_type, sandbox_template) to avoid breaking existing consumers. It now fires for all Sandboxes regardless of creation path. - agent_sandbox_ready_latency_ms (new histogram): measures latency from controller first observed Sandbox to Ready state using a high-precision annotation (nanosecond-level), partially addressing the millisecond precision gap in CreationTimestamp. Carries launch_type, sandbox_template, and owned_by labels. Users who need per-owner latency breakdown should prefer this metric. A separate _total counter is intentionally omitted because the histogram _count provides equivalent creation counts. The controller stamps two annotations on the Sandbox: - sandbox-controller-first-observed-at: stamped on first reconcile, used as start time for the ready latency metric. - sandbox-first-ready-at: stamped when the Sandbox first reaches Ready. Prevents duplicate metric recording on re-Ready events and provides operators with a precise first-ready timestamp. The recordSandboxCreationLatency method is removed from the SandboxClaimReconciler since the core controller now handles it. Signed-off-by: E Geiger <egeiger@redhat.com>
a1eb5e5 to
99b9940
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (2)
controllers/sandbox_controller_test.go (1)
3541-3569: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAssert ready-latency count after the first reconcile too.
The first-reconcile assertion (Line 3553) only checks
SandboxCreationLatencycount, notSandboxReadyLatency. Since this test specifically exercises the "already ready" duplicate-recording guard for both new metrics, assertingtestutil.CollectAndCount(asmetrics.SandboxReadyLatency)equals 1 right after the first reconcile would close a gap in verifying the ready-latency recording path for this scenario.✅ Suggested addition
_, err := r.Reconcile(t.Context(), req) require.NoError(t, err) assert.Equal(t, 1, testutil.CollectAndCount(asmetrics.SandboxCreationLatency), "first reconcile should record creation latency") + assert.Equal(t, 1, testutil.CollectAndCount(asmetrics.SandboxReadyLatency), "first reconcile should record ready latency")🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@controllers/sandbox_controller_test.go` around lines 3541 - 3569, The first reconcile in SandboxReconciler.Reconcile only verifies SandboxCreationLatency, leaving the SandboxReadyLatency recording path untested for the initial Ready transition. Update the test to assert that testutil.CollectAndCount(asmetrics.SandboxReadyLatency) is 1 immediately after the first Reconcile call, alongside the existing creation-latency check, so both first-ready metrics are covered before resetting them for the duplicate-reconcile guard.internal/metrics/metrics.go (1)
88-110: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueDuplicate bucket literal.
The
Bucketsslice forSandboxReadyLatencyis an exact copy ofSandboxCreationLatency's. Extracting a shared var avoids drift if bucket boundaries are tuned later.♻️ Proposed refactor
+// standardLatencyBuckets are used for millisecond-scale lifecycle latency histograms +// (50ms to 10 minutes). +var standardLatencyBuckets = []float64{50, 100, 250, 500, 1000, 2500, 5000, 10000, 30000, 60000, 120000, 240000, 300000, 600000} + SandboxCreationLatency = prometheus.NewHistogramVec( prometheus.HistogramOpts{ Name: "agent_sandbox_creation_latency_ms", Help: "Latency from Sandbox creation to Pod Ready state in milliseconds.", - // Buckets for latency from 50ms to 10 minutes - Buckets: []float64{50, 100, 250, 500, 1000, 2500, 5000, 10000, 30000, 60000, 120000, 240000, 300000, 600000}, + Buckets: standardLatencyBuckets, }, []string{"namespace", "launch_type", "sandbox_template"}, ) SandboxReadyLatency = prometheus.NewHistogramVec( prometheus.HistogramOpts{ Name: "agent_sandbox_ready_latency_ms", Help: "Latency from controller first observed Sandbox to Ready state in milliseconds.", - // Buckets for latency from 50ms to 10 minutes - Buckets: []float64{50, 100, 250, 500, 1000, 2500, 5000, 10000, 30000, 60000, 120000, 240000, 300000, 600000}, + Buckets: standardLatencyBuckets, }, []string{"namespace", "launch_type", "sandbox_template", "owned_by"}, )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@internal/metrics/metrics.go` around lines 88 - 110, The `SandboxReadyLatency` histogram in `metrics.go` duplicates the same bucket slice used by `SandboxCreationLatency`; extract the shared bucket boundaries into a single package-level variable and have both histogram definitions reference it. Update the `SandboxCreationLatency` and `SandboxReadyLatency` declarations to use the shared bucket constant so future bucket changes stay consistent.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@controllers/sandbox_controller.go`:
- Around line 232-233: The first-ready handling in sandbox reconciliation is
marking creation metrics as done before the durable idempotency marker is safely
written. In the code around recordSandboxCreationMetrics and the
SandboxFirstReadyAnnotation stamping path, make sure the annotation is attempted
before treating the event as recorded, and do not return early after emitting
creation latency if the marker still needs to be set. Also stop swallowing stamp
failures: propagate them with context so reconcile retries remain idempotent and
safe.
---
Nitpick comments:
In `@controllers/sandbox_controller_test.go`:
- Around line 3541-3569: The first reconcile in SandboxReconciler.Reconcile only
verifies SandboxCreationLatency, leaving the SandboxReadyLatency recording path
untested for the initial Ready transition. Update the test to assert that
testutil.CollectAndCount(asmetrics.SandboxReadyLatency) is 1 immediately after
the first Reconcile call, alongside the existing creation-latency check, so both
first-ready metrics are covered before resetting them for the
duplicate-reconcile guard.
In `@internal/metrics/metrics.go`:
- Around line 88-110: The `SandboxReadyLatency` histogram in `metrics.go`
duplicates the same bucket slice used by `SandboxCreationLatency`; extract the
shared bucket boundaries into a single package-level variable and have both
histogram definitions reference it. Update the `SandboxCreationLatency` and
`SandboxReadyLatency` declarations to use the shared bucket constant so future
bucket changes stay consistent.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7e1f821b-d23f-467a-9185-ba13c758c054
📒 Files selected for processing (5)
controllers/sandbox_controller.gocontrollers/sandbox_controller_test.goextensions/controllers/sandboxclaim_controller.gointernal/metrics/metrics.gointernal/metrics/metrics_test.go
💤 Files with no reviewable changes (1)
- extensions/controllers/sandboxclaim_controller.go
There was a problem hiding this comment.
Pull request overview
This PR extends core SandboxReconciler observability so that lifecycle latency telemetry is emitted for all Sandboxes (including those created directly, without a SandboxClaim), by moving agent_sandbox_creation_latency_ms into the core controller and adding a new agent_sandbox_ready_latency_ms histogram based on controller-stamped timestamps.
Changes:
- Added new core histogram
agent_sandbox_ready_latency_ms(labels:namespace,launch_type,sandbox_template,owned_by) and registered it in the metrics registry. - Stamped Sandbox-level observability annotations from the core controller (
sandbox-controller-first-observed-at,sandbox-first-ready-at) and recorded creation/ready latency metrics on the first Ready transition. - Removed creation-latency recording from the
SandboxClaimReconcilersince it is now emitted from the core controller.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| internal/metrics/metrics.go | Adds new Sandbox-ready histogram + ownership label constants; registers metric; adds record helper. |
| internal/metrics/metrics_test.go | Adds unit test coverage for the new ready-latency recorder helper. |
| extensions/controllers/sandboxclaim_controller.go | Removes now-redundant Sandbox creation-latency recording from the claim controller. |
| controllers/sandbox_controller.go | Stamps new Sandbox observability annotations and records lifecycle metrics from the core reconciler. |
| controllers/sandbox_controller_test.go | Adds reconciler tests around stamping + metric emission behavior for direct/owned Sandboxes. |
Propagate annotation patch errors from recordSandboxCreationMetrics so that reconcile retries if the first-ready annotation fails to persist. The retry is safe because the status already has Ready=True persisted, so the oldReady guard prevents duplicate metric recording. Fix early return on malformed observability annotation that skipped stamping the first-ready annotation. Parse errors now skip only the ready latency metric while still proceeding to stamp the annotation. Extract shared sandbox latency histogram buckets into a package-level variable (sandboxLatencyBuckets) so SandboxCreationLatency and SandboxReadyLatency stay consistent on future bucket changes. Assert SandboxReadyLatency observation count after the first reconcile in TestRecordSandboxCreationMetrics_AlreadyReady, alongside the existing SandboxCreationLatency assertion, so both first-ready metrics are covered before the duplicate-reconcile guard test. Signed-off-by: E Geiger <egeiger@redhat.com>
The expectedOwnedBy field in TestRecordSandboxCreationMetrics was populated in every table case but never asserted, left over from the removed SandboxCreationTotal counter. Use CurryWith to verify that SandboxReadyLatency records exactly one observation with the expected owned_by label value in each case, covering the full reconcile path from resolveOwnedBy through metric recording. Signed-off-by: E Geiger <egeiger@redhat.com>
What this PR does / why we need it:
Sandbox lifecycle metrics (
agent_sandbox_creation_latency_ms, claimstartup latency, etc.) were recorded exclusively by the SandboxClaim
extension controller. Sandboxes created directly — without a
SandboxClaim — only appeared in the point-in-time
agent_sandboxesgauge with no creation latency or ready latency telemetry.
This PR moves
agent_sandbox_creation_latency_msrecording into thecore
SandboxReconcilerand adds a newagent_sandbox_ready_latency_mshistogram so that all Sandboxes get lifecycle telemetry regardless of
creation path.
Existing metric (moved):
agent_sandbox_creation_latency_ms— now recorded by the coreSandboxReconcilerinstead ofSandboxClaimReconciler. The label setis unchanged (
namespace,launch_type,sandbox_template) to avoidbreaking existing consumers. It now fires for all Sandboxes.
New metric:
agent_sandbox_ready_latency_ms— histogram measuring latency fromcontroller first observed Sandbox to Ready state. Uses a high-precision
annotation (
sandbox-controller-first-observed-at, nanosecond-level)rather than
CreationTimestamp(second-level precision), partiallyaddressing [Bug] agent_sandbox_creation_latency_ms lacks millisecond precision due to use of CreationTimestamp #781. Labels:
namespace,launch_type,sandbox_template,owned_by. Users who need per-owner latencybreakdown should prefer this metric over
creation_latency_ms.A separate
_totalcounter is intentionally omitted because thehistogram's built-in
_countprovides equivalent creation counts.Annotations:
The controller stamps two annotations on the Sandbox:
agents.x-k8s.io/sandbox-controller-first-observed-at— stamped onfirst reconcile, used as the start time for the ready latency metric.
agents.x-k8s.io/sandbox-first-ready-at— stamped when the Sandboxfirst reaches Ready. Prevents duplicate metric recording on re-Ready
events (e.g. readiness probe flaps) and provides operators with a
precise timestamp of when the Sandbox first became operational.
Other changes:
recordSandboxCreationLatencyis removed fromSandboxClaimReconcilersince the core controller now handles it for all Sandboxes.
AI disclosure: This PR was written in part with the assistance of
generative AI.
Which issue(s) this PR is related to:
Fixes #686
Ref #781
Release Note