feat: Initial implementation of power metrics in aiperf. by FrankD412 · Pull Request #803 · ai-dynamo/aiperf

FrankD412 · 2026-04-04T06:16:22Z

This PR adds some basic power metrics leveraging GPU telemetry to compute the amount of power used during the benchmark.

Summary by CodeRabbit

New Features
- GPU efficiency metrics: total power, total energy, and output-tokens-per-joule.
- On-demand GPU telemetry collection entrypoint.
- Baseline GPU metric capture during telemetry setup.
Tests
- Updated unit tests to exercise the public collection entrypoint and support new telemetry behaviors.
Chores
- Expanded test logging mocks to include informational logs.

copy-pr-bot · 2026-04-04T06:16:27Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-04-04T06:16:31Z

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@8a9d8aa680fea00293f8856c280c212c2c70cc80

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@8a9d8aa680fea00293f8856c280c212c2c70cc80

Last updated for commit: 8a9d8aa • Browse code

coderabbitai · 2026-04-04T06:21:38Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

Walkthrough

Adds GPU efficiency metric computation and an explicit collector entrypoint: a new accumulator method computes total GPU power, energy, and output-tokens-per-joule; collector collection method made public for on-demand baseline captures; records and manager invoke these flows.

Changes

Cohort / File(s)	Summary
Efficiency computation `src/aiperf/gpu_telemetry/accumulator.py`	Added imports for `EnergyMetricUnit`/`PowerMetricUnit` and new `GPUTelemetryAccumulator.compute_efficiency_metrics(metric_results, time_filter)` which aggregates per-GPU power and energy across DCGM endpoints and conditionally emits `total_gpu_power`, `total_gpu_energy`, and `output_tokens_per_joule`.
Protocols `src/aiperf/gpu_telemetry/protocols.py`	Added `async def collect_and_process_metrics(self) -> None` to `GPUTelemetryCollectorProtocol` and `def compute_efficiency_metrics(self, metric_results: list[MetricResult], time_filter: TimeRangeFilter) -> list[MetricResult]` to `GPUTelemetryAccumulatorProtocol`; added `TimeRangeFilter` import under TYPE_CHECKING.
Records integration `src/aiperf/records/records_manager.py`	When GPU accumulator exists, builds a `TimeRangeFilter` from phase stats and calls `compute_efficiency_metrics(...)`, extending `records_results` with returned efficiency metrics before publishing.
Manager baseline capture `src/aiperf/gpu_telemetry/manager.py`	After creating a PyNVML collector, performs a baseline capture via `collector.initialize()` and `collector.collect_and_process_metrics()`, logging success at info and warning on failure; DCGM endpoint baseline logs changed from debug→info.
Collector API rename `src/aiperf/gpu_telemetry/pynvml_collector.py`	Renamed private `_collect_and_process_metrics` to public `collect_and_process_metrics` and updated its invocation in the background loop.
Tests updated `tests/unit/gpu_telemetry/test_pynvml_collector.py`, `tests/unit/gpu_telemetry/test_telemetry_manager.py`	Tests updated to call the public `collect_and_process_metrics()` entrypoint and to mock `manager.info` in telemetry manager tests.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I hopped through logs at break of dawn,

watts and joules I proudly fawn,
tokens counted, energy measured,
metrics stitched and neatly treasured,
a tiny rabbit cheers the run 🌱⚡

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately reflects the main change: adding initial power metrics functionality to aiperf via GPU telemetry integration.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-04-04T06:23:37Z

Codecov Report

❌ Patch coverage is 89.07563% with 13 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/aiperf/gpu_telemetry/accumulator.py	85.71%	7 Missing and 2 partials ⚠️
src/aiperf/gpu_telemetry/pynvml_collector.py	33.33%	2 Missing ⚠️
src/aiperf/records/records_manager.py	83.33%	1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

FrankD412 · 2026-04-06T21:14:32Z

This PR addresses: https://linear.app/nvidia/issue/AIP-838/implement-an-energytoken-benchmark-estimation

FrankD412 · 2026-04-13T23:49:52Z

After some local testing, some of the metrics show up but not all. I need to figure out how to run a metric calculation for energy/token and have it show up.

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/superpowers/plans/2026-04-13-monitor.md`:
- Around line 13-28: Replace the ASCII table under "File Map" with a mermaid
diagram block (```mermaid ... ```) that groups the created source files, the
modified source file, and the created test files as subgraphs; include nodes for
monitor/__init__.py (Package marker), monitor/models.py (GpuSnapshot model),
monitor/gpu.py (query_gpus), monitor/api.py (create_app), monitor/poller.py
(CsvPoller), cli_commands/monitor.py (CLI subcommand), the modified cli.py
(Register monitor command), and the three test files
tests/unit/monitor/__init__.py, test_gpu.py, test_api.py, and test_poller.py,
following the suggested structure so the diagram replaces the table exactly.

In `@src/aiperf/gpu_telemetry/manager.py`:
- Around line 212-215: The broad exception handler in manager.py (the "except
Exception as e" that calls self.warning(f"GPU Telemetry: Failed to capture
baseline from pynvml: {e}")) is missing the BLE001 suppression comment; add "#
noqa: BLE001" to that except line to match other handlers and include the same
short explanatory comment used at lines 240/334/428, so the except Exception as
e line becomes explicitly marked as intentionally broad while keeping the
existing self.warning call intact.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: cd0f4da4-9c3d-4a42-8ace-c06012576f1b

📥 Commits

Reviewing files that changed from the base of the PR and between 946bb59 and deaf3a6.

📒 Files selected for processing (9)

docs/superpowers/plans/2026-04-13-monitor.md
docs/superpowers/specs/2026-04-13-monitor-design.md
src/aiperf/gpu_telemetry/accumulator.py
src/aiperf/gpu_telemetry/manager.py
src/aiperf/gpu_telemetry/protocols.py
src/aiperf/gpu_telemetry/pynvml_collector.py
src/aiperf/records/records_manager.py
tests/unit/gpu_telemetry/test_pynvml_collector.py
tests/unit/gpu_telemetry/test_telemetry_manager.py

✅ Files skipped from review due to trivial changes (1)

docs/superpowers/specs/2026-04-13-monitor-design.md

🚧 Files skipped from review as they are similar to previous changes (3)

src/aiperf/records/records_manager.py
src/aiperf/gpu_telemetry/protocols.py
src/aiperf/gpu_telemetry/accumulator.py

FrankD412 · 2026-04-16T03:02:55Z

Ok, I was concerned when aiperf was reporting different numbers than the reference implementation in trtllm-bench, but I ended up just wrapping both calls in a simple script to measure. I expected it to over-estimate slightly in both cases, but it looks like the math works out. I'm really surprised that the trtllm-bench case is quite different under a concurrency of 1024 with a ISL/OSL of 1k/1k. Here's what Llama-3.1-7B on 8xH200 looks like with a simple config:

CSV Export: /home/scratch.fdinatale_coreai/code/aiperf/h200/artifacts/meta-llama_Llama-3.1-70B-Instruct-openai-completions-concurrency1024-request_rateinf/profile_export_aiperf.csv
JSON Export: /home/scratch.fdinatale_coreai/code/aiperf/h200/artifacts/meta-llama_Llama-3.1-70B-Instruct-openai-completions-concurrency1024-request_rateinf/profile_export_aiperf.json
Log File: /home/scratch.fdinatale_coreai/code/aiperf/h200/artifacts/meta-llama_Llama-3.1-70B-Instruct-openai-completions-concurrency1024-request_rateinf/logs/aiperf.log


GPU 0 (NVIDIA H200): 143653.77 J
GPU 1 (NVIDIA H200): 147784.43 J
GPU 2 (NVIDIA H200): 149026.62 J
GPU 3 (NVIDIA H200): 148156.23 J
GPU 4 (NVIDIA H200): 145397.66 J
GPU 5 (NVIDIA H200): 148267.18 J
GPU 6 (NVIDIA H200): 153568.17 J
GPU 7 (NVIDIA H200): 149799.60 J
Total energy: 1185653.65 J

Energy reported by aiperf

  "osl_mismatch_count": {
    "unit": "requests",
    "avg": 1421.0
  },
  "total_token_throughput": {
    "unit": "tokens/sec",
    "avg": 27921.504814287888
  },
  "total_gpu_power": {
    "unit": "W",
    "avg": 4724.019916949152
  },
  "total_gpu_energy": {
    "unit": "J",
    "avg": 1067107.4119999986
  },
  "output_token_throughput_per_watt": {
    "unit": "tps/W",
    "avg": 3.027279746105967

`trtllm-bench`

-- Energy Metrics --------------------------------------

Total Energy (J):                                 1647268.6290
Output Tokens per Second per Watt (tps/W):         1.8649
Average GPU Power (W):                            666.9243

-- Request Latency Breakdown (ms) -----------------------

Validation output:

GPU 0 (NVIDIA H200): 214223.76 J
GPU 1 (NVIDIA H200): 217174.15 J
GPU 2 (NVIDIA H200): 218900.36 J
GPU 3 (NVIDIA H200): 218245.88 J
GPU 4 (NVIDIA H200): 215508.04 J
GPU 5 (NVIDIA H200): 217891.44 J
GPU 6 (NVIDIA H200): 222311.65 J
GPU 7 (NVIDIA H200): 219760.09 J
Total energy: 1744015.37 J

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/aiperf/gpu_telemetry/accumulator.py`:
- Around line 337-368: The current per-GPU metric extraction only catches
NoMetricValue in the blocks calling gpu_data.get_metric_result for
"gpu_power_usage" and "energy_consumption" (power_result / energy_result),
allowing any other exception to bubble up and abort processing; update those
try/except blocks to also catch generic Exception as e (after NoMetricValue) and
log the unexpected error (include the gpu_uuid and exception text) via
self.debug or self.error so the loop continues to the next GPU and overall
efficiency metric aggregation isn't aborted.
- Around line 370-372: The early return when power_gpu_count == 0 in the
accumulator (around power_gpu_count check) is dropping metrics that don't
require instantaneous power (e.g., total_gpu_energy and
output_tokens_per_joule); change the logic in the method containing that check
(look for power_gpu_count, total_gpu_energy, output_tokens_per_joule) so you no
longer return [] when power data is missing—instead compute and include metrics
that can be derived from available energy data and only skip or set to None
those metrics that specifically require gpu_power_usage/instantaneous power;
ensure the final returned list still contains non-power metrics and that
power-dependent fields are omitted or null when power_gpu_count == 0.
- Around line 325-329: The code incorrectly reads
total_output_tokens_result.avg; since TotalOutputTokensMetric inherits
DerivedSumMetric and the derived scalar is stored as a sum, change the
assignment that computes total_output_tokens to read
total_output_tokens_result.sum instead of .avg (locate the
total_output_tokens_result variable and the total_output_tokens assignment near
the TotalOutputTokensMetric / DerivedSumMetric usage) so the value semantically
matches the DerivedSumMetric behavior.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: a0bffc0c-ed37-496c-ad35-c82557ac1743

📥 Commits

Reviewing files that changed from the base of the PR and between deaf3a6 and 6676436.

📒 Files selected for processing (1)

src/aiperf/gpu_telemetry/accumulator.py

devin-ai-integration

Devin Review found 1 potential issue.

View 3 additional findings in Devin Review.

debermudez

Four issues found during review.

FrankD412 · 2026-04-28T05:04:27Z

Four issues found during review.

Alright -- fixed the issues.

Operator can now point both power-measurement serve drivers at a raw_payload JSONL file (produced e.g. by tools/trtllm_bench_to_aiperf.py) instead of using the built-in synthetic dataset generation. When --input-file is set: - aiperf is invoked with --custom-dataset-type raw_payload --input-file <path> - --synthetic-input-tokens-* and --output-tokens-* flags are dropped - --extra-inputs min_tokens/max_tokens trailers are dropped (per-row max_tokens in the JSONL dictates generation length) - --extra-inputs ignore_eos:true stays (server-behavior knob) --isl and --osl are now sentinel-default None and validated as mutually exclusive with --input-file (rc=2 on collision). When --input-file is unset both are backfilled to 1000 in main(), so existing invocations are byte-for-byte unchanged. --endpoint-type is already a CLI flag (default completions); the operator passes --endpoint-type chat themselves when using --input-file since aiperf's raw_payload loader hard-requires `messages`-shaped JSONL lines. The driver does not auto-flip. summary.json gains an "input_file" key in both drivers so the artifact is self-describing in both modes. The resolved absolute path is stored back into args.input_file after the existence check so summary.json and the aiperf invocation agree on the path. Spec: docs/superpowers/specs/2026-05-27-power-driver-input-file-design.md Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

debermudez

Code Review — PR #803 (power-efficiency metrics)

Reviewed against b4f1f58c (current PR head). Two passes combined:

Branch code review — 10 findings (3 High / 6 Medium / 1 Low).
LLM-ergonomics review — 8 findings (2 High / 5 Medium / 1 Low).

Most findings anchor to specific lines and are posted inline. Two cross-cutting items don't anchor cleanly:

High — Public metrics added without docs updates (ergonomics Axis 6)

This PR adds three new public metrics (total_gpu_power, total_gpu_energy, output_tokens_per_joule), a new unit string (tokens/J), and a new "externally-injected derived metric" pattern, but no docs are updated. CLAUDE.md mandates: "DOCUMENTATION IS REQUIRED, NOT OPTIONAL ... Metrics definitions or formulas → docs/metrics-reference.md". Suggested:

Add a ## GPU Power Efficiency Metrics section to docs/metrics-reference.md with the three tags, units, and formulas (Σ avg(gpu_power_usage), Σ delta(energy_consumption), total_output_tokens / total_gpu_energy).
Add an ### Externally-Injected Derived Metric entry to docs/dev/patterns.md citing power_efficiency_metrics.py as the reference file.
If the pattern warrants it, sync across the four-file rule (AGENTS.md / CLAUDE.md / .github/copilot-instructions.md / .cursor/rules/python.mdc) and run make check-agent-files-sync.

Low — `tools/` debug runner duplication

tools/run_trtllm_bench_with_power.py, tools/run_trtllm_serve_with_power.py, and tools/run_trtllm_serve_with_power_dcgm.py each define their own http_get / http_post / wait_for_server / write_json helpers. Extracting tools/_power_runner_common.py would prevent helper drift (though tools/ is out of strict ergonomics scope).

Runtime reproduction (deferred — should run on a GPU host pre-merge)

# F-04: degenerate phase_stats
aiperf profile -m <model> --no-warmup --request-count 0 --gpu-telemetry pynvml \
  --artifact_dir /tmp/aiperf-f04 --ui simple

# F-08: GPU flags absent when telemetry disabled
aiperf profile -m <model> --request-count 50 --no-gpu-telemetry \
  --artifact_dir /tmp/aiperf-f08 --ui simple
# expect: profile_export_aiperf.json does NOT contain total_gpu_power / total_gpu_energy / output_tokens_per_joule

# F-02: cross-phase energy leak (requires multi-phase plan)
aiperf profile ... --plan two-phase.yaml ...
# expect: phase 2's total_gpu_energy reflects only phase 2 delta

What's good

The asymmetric gauge/counter time-window comment at accumulator.py:407-408 is exactly the "why, not what" comment style CLAUDE.md prescribes, and test_energy_filter_widens_end_ns_while_power_filter_stays_bounded (test_accumulator.py:549) locks the invariant in.
_capture_collector_baseline (manager.py:266-292) cleanly relocates collector.initialize() to the baseline-capture path so the energy counter has a baseline before profile start.
Lambda-wrapped debug logs (accumulator.py:325-338, 364-380) match the "lambda for expensive logs" rule and correctly use uuid=gpu_uuid to dodge the loop-variable closure bug.
Mechanical floor is clean: check-ergonomics (257 baselined, 0 new) and ruff-baselined (305 baselined, 0 new) both pass.

Full review artifacts (local-only): artifacts/code-review.md and artifacts/code-review-2026-05-27/llm-ergonomics-pr-803-feat-power-efficiency-metrics.md.

Replace the magic 1e6 conversion in _sum_gpu_energy_joules with the unit-defined factor so a future rename or rescale of EnergyMetricUnit cannot silently drift the energy aggregation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

…ract Add Returns and Example blocks spelling out the 0-3 per-tag omission conditions, clarify that metric_results is read-only and scanned only for total_output_tokens, document the intentional energy-window widening, and cross-reference sibling summarize() to distinguish sync cross-GPU aggregate (one per phase) from async per-GPU emission (periodic). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

…cUnit Replace the hardcoded "tokens/J" literal in compute_efficiency_metrics with str(GenericMetricUnit.TOKENS_PER_JOULE) so the accumulator-emitted unit can't drift from OutputTokensPerJouleMetric.unit on a rename. Add test_emitted_units_match_metric_class_units pinning the relationship for all three power-efficiency tags rather than the literal string, and update the existing happy-path assertion to use the enum form for the same reason. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

…s block The Protocol's Returns block described an all-or-nothing contract ("Empty if no GPU data is available for any metric.."), but the implementation in GPUTelemetryAccumulator returns 0-3 results with independent per-tag omission. Rewrite to match (and fix the trailing double-period typo) so agents implementing the Protocol model the actual behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

Rename len_records_results -> records_completed and document why the snapshot is taken before records_results.extend(efficiency_metrics): ProcessRecordsResult.completed must reflect request-derived records only, not derived aggregates (total_gpu_power, total_gpu_energy, output_tokens_per_joule). Add test_completed_excludes_efficiency_metrics in TestRecordsManagerEfficiencyMetricsSnapshot to guard against a future PR reordering or removing the snapshot. Verified the test fails when the snapshot is moved after the extend. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

…aders MetricResult.count is stripped to None by to_json_result for DERIVED metrics (record_models.py:70-90), so a "1 of 2 GPUs reporting" run was indistinguishable from "2 of 2" in the CLI audit surface. Bake the valid-data count into each header via a small _gpu_count_suffix helper: total_gpu_power -> "Total GPU Power (N GPUs)" total_gpu_energy -> "Total GPU Energy (N GPUs)" output_tokens_per_joule -> "Output Tokens per Joule (N GPUs)" Singular form ("1 GPU") is used when N == 1. Existing happy-path tests gain header assertions; new test_header_reflects_partial_cohort_count pins the "1 of 2" surface so future regressions are caught. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

Three convergent issues across TotalGpuPowerMetric, TotalGpuEnergyMetric, and OutputTokensPerJouleMetric — same fix per class: - Class docstring gains an explicit `Invariant:` paragraph naming the injection site (GPUTelemetryAccumulator.compute_efficiency_metrics) and the catching code path (MetricResultsProcessor.update_derived_metrics). Future contributors copy-pasting these as "the derived-metric pattern" now see the contract spelled out. - `_derive_value` return annotation `-> float` -> `-> NoReturn` (typing). The body unconditionally raises, so the previous signature lied; the truthful form lets type-checkers reason about unreachable downstream code. - `NoMetricValue` message rewritten to name the operation (derive from MetricResultsDict), the injection site, and the path that's expected to catch and skip. The previous form named only the source ("X is computed by the GPU telemetry accumulator") and gave agents no clue where the contract is enforced. Add tests/unit/metrics/test_power_efficiency_metrics.py with a parametrized contract test pinning each of the three error-message invariants (tag, operation, injection-site) so a future weakening of the message is caught. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

…window _process_results previously constructed a TimeRangeFilter for the efficiency-metrics block with two consecutive time.time_ns() fallbacks when phase_stats.start_ns / requests_end_ns were both None. The two calls land nanoseconds apart, producing an effectively zero-width window. Total GPU power (gauge-based) would then either emit a misleading 0.0W as a valid-looking metric or be silently dropped, depending on telemetry sample jitter. Extract the efficiency-metrics block into _apply_gpu_efficiency_metrics with an explicit guard: when either bound is None (the degenerate "no records flowed for this phase" case), log a warning naming the phase and the missing fields and skip the compute call. When both bounds are set, honor them directly with no fallback. Extracting also keeps _process_results under the PLR0912 branch budget that the inline guard would have pushed it over. Add TestRecordsManagerEfficiencyMetricsDegeneratePhase with two cases: both bounds None, and only one None (partial-degenerate). Sanity-checked by reverting the guard — the test failed with the exact ~1us-wide TimeRangeFilter the reviewer described. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

…race compute_efficiency_metrics previously built the energy filter with end_ns=None — the intent was to widen the window by one collection cadence so the trailing counter scrape that lands just after requests_end_ns is included in the delta. The unbounded form, however, captures every post-phase sample present in the (append-only) TelemetryHierarchy: late-finalization drift absorbs cooldown samples into the phase's energy, and in multi-phase runs an earlier-phase window can reach forward into a later phase's samples if the finalization ordering ever changes. Replace end_ns=None with end_ns = phase_end + Environment.GPU.FINAL_SCRAPE_GRACE_NS (new env var, default 666_000_000 ns ~= 2x the default 333 ms COLLECTION_INTERVAL, configurable via AIPERF_GPU_FINAL_SCRAPE_GRACE_NS). Move the rationale from the call site into _sum_gpu_energy_joules where the counter math actually consumes the filter, per reviewer preference. Update test_energy_filter_widens_end_ns_..._stays_bounded to pin the new bounded form, and add test_repeated_calls_use_bounded_energy_window_per_phase: simulates two consecutive phases and asserts each call's energy filter ends at phase.end_ns + grace, not None. The latter test also asserts the grace window stays inside the phase-2 start. Sanity-checked by restoring end_ns=None — both tests fail with "None == (5000000000 + 666000000)", exactly the leak the reviewer described. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

Add a GPU Power Efficiency Metrics section to metrics-reference.md covering the three tags introduced by this PR — total_gpu_power, total_gpu_energy, and output_tokens_per_joule. Each entry follows the existing per-metric format (Type, Formula, Notes) and links to the new Externally-Injected Derived Metric pattern in docs/dev/patterns.md (added in a follow-up commit) so readers can find the contract that explains why _derive_value never fires for these tags. The section also documents: - The bounded energy-window math via FINAL_SCRAPE_GRACE_NS. - The "(N GPUs)" header attribution that distinguishes partial-cohort runs from full ones. - The negative-delta clamp on counter resets. - The independent per-tag omission contract. Update the Table of Contents to include the new section and its three subsection anchors. Per CLAUDE.md "DOCUMENTATION IS REQUIRED, NOT OPTIONAL ... Metrics definitions or formulas -> docs/metrics-reference.md". Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

Add an Externally-Injected Derived Metric Pattern section to docs/dev/patterns.md, slotted alongside the other specialized contract patterns (NaN/Inf Discipline, Safe Filesystem Reads). The new entry captures the three-part contract introduced by the power_efficiency_metrics classes: 1. Invariant: paragraph in the class docstring naming the injection site and the catching path. 2. _derive_value returns NoReturn from typing — the body unconditionally raises, so the truthful annotation is NoReturn rather than the metric's value type. 3. NoMetricValue message names the operation (Cannot derive 'X' from MetricResultsDict), the injection site (GPUTelemetryAccumulator.compute_efficiency_metrics), and the catching path (MetricResultsProcessor.update_derived_metrics). Reference file is power_efficiency_metrics.py; injection site is GPUTelemetryAccumulator.compute_efficiency_metrics. The new metrics- reference section added in the previous commit links here. Skipped the four-file sync (AGENTS.md / CLAUDE.md / copilot-instructions / python.mdc): the reviewer hedged with "if the pattern warrants it", and this is a narrow pattern specific to GPU telemetry + externally computed values, not a project-wide coding standard. The four files already direct readers to docs/dev/patterns.md for code-pattern lookups via the documentation table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

…etrics Add a fourth power-efficiency metric that reports per-user energy footprint: total_gpu_energy / configured concurrency. Lower is better — a more efficient deployment serves the same load for less energy per concurrent user. - New GenericMetricUnit.JOULES_PER_USER ("joules/user"). - New EnergyPerUserMetric class in power_efficiency_metrics.py following the externally-injected derived metric pattern (Invariant: docstring, _derive_value returns NoReturn, structured error message). display_order=903, flags=MetricFlags.NONE (smaller-is-better default). - GPUTelemetryAccumulator.compute_efficiency_metrics now reads the profiling phase's concurrency via run.cfg.get_profiling_phases() and emits energy_per_user when both concurrency is a positive int AND energy_count > 0. Concurrency=None (e.g. pure --request-rate runs) and missing energy data both yield omission, consistent with the other power-efficiency metrics' independent-omission contract. Tests: happy-path now expects 4 metrics (default test config resolves to concurrency=1); multiple-GPUs gains energy_per_user assertion; partial-cohort header check covers the new tag; unit-relationship test extends to EnergyPerUserMetric; three new tests cover concurrency scaling, None-omission, and no-energy-data omission. The _derive_value contract test parametrize gains the new class. Docs: new "Energy per User" subsection in docs/metrics-reference.md under GPU Power Efficiency Metrics; TOC and section intro updated. Function size: compute_efficiency_metrics grew past the 80-line ergonomics cap (101 lines now) and accumulator.py past the 500-line file cap (511 lines). Both baselined pending a follow-up refactor that splits efficiency-metrics construction out of the accumulator. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>

FrankD412 · 2026-05-28T22:36:43Z

Confirmed that the energy_per_user statistic is showing up and things appear to be behaving. Updating and setting to auto-merge once tests complete.

debermudez

Nice work man.

FrankD412 self-assigned this Apr 4, 2026

This comment was marked as resolved.

Sign in to view

FrankD412 changed the title ~~Draft: Initial implementation of power metrics in aiperf.~~ Apr 6, 2026

FrankD412 changed the title ~~Draft: Initial implementation of power metrics in aiperf. [https://linear.app/nvidia/issue/AIP-838/implement-an-energytoken-benchmark-estimation]~~ Apr 6, 2026

FrankD412 force-pushed the feat-power-efficiency-metrics branch from 1238e5e to deaf3a6 Compare April 15, 2026 17:07

coderabbitai Bot reviewed Apr 15, 2026

View reviewed changes

Comment thread docs/superpowers/plans/2026-04-13-monitor.md Outdated

Comment thread src/aiperf/gpu_telemetry/manager.py Outdated

FrankD412 requested review from ajcasagrande and debermudez April 16, 2026 03:03

FrankD412 changed the title ~~Draft: Initial implementation of power metrics in aiperf.~~ Apr 16, 2026

github-actions Bot added the feat label Apr 16, 2026

coderabbitai Bot reviewed Apr 16, 2026

View reviewed changes

Comment thread src/aiperf/gpu_telemetry/accumulator.py Outdated

Comment thread src/aiperf/gpu_telemetry/accumulator.py

Comment thread src/aiperf/gpu_telemetry/accumulator.py Outdated

FrankD412 force-pushed the feat-power-efficiency-metrics branch 2 times, most recently from aea6132 to 279c92d Compare April 16, 2026 03:15

devin-ai-integration Bot reviewed Apr 17, 2026

View reviewed changes

Comment thread src/aiperf/records/records_manager.py Outdated

FrankD412 force-pushed the feat-power-efficiency-metrics branch from 5da4623 to 1ae073e Compare April 24, 2026 16:29

debermudez reviewed Apr 28, 2026

View reviewed changes

Comment thread src/aiperf/records/records_manager.py Outdated

Comment thread src/aiperf/gpu_telemetry/accumulator.py Outdated

Comment thread src/aiperf/gpu_telemetry/accumulator.py Outdated

Comment thread src/aiperf/gpu_telemetry/protocols.py Outdated

FrankD412 force-pushed the feat-power-efficiency-metrics branch from db6810f to 42b1fe5 Compare April 28, 2026 04:57

dynamo-ops reviewed May 6, 2026

View reviewed changes

Comment thread src/aiperf/gpu_telemetry/accumulator.py Outdated

Comment thread src/aiperf/records/records_manager.py Outdated

FrankD412 force-pushed the feat-power-efficiency-metrics branch 2 times, most recently from 320f376 to 32f2c35 Compare May 6, 2026 23:40

FrankD412 force-pushed the feat-power-efficiency-metrics branch 2 times, most recently from 2daad4f to 545a3e9 Compare May 14, 2026 20:57

FrankD412 requested review from debermudez and dynamo-ops May 27, 2026 17:32

debermudez reviewed May 27, 2026

View reviewed changes

FrankD412 and others added 13 commits May 27, 2026 13:53

Merge branch 'main' into feat-power-efficiency-metrics

3405f7c

debermudez enabled auto-merge (squash) May 28, 2026 18:33

debermudez disabled auto-merge May 28, 2026 18:33

FrankD412 disabled auto-merge May 28, 2026 18:33

Merge branch 'main' into feat-power-efficiency-metrics

ee22a89

FrankD412 enabled auto-merge (squash) May 28, 2026 22:36

FrankD412 disabled auto-merge May 28, 2026 22:36

FrankD412 enabled auto-merge (squash) May 28, 2026 22:37

FrankD412 requested a review from debermudez May 28, 2026 22:37

Merge branch 'main' into feat-power-efficiency-metrics

8a9d8aa

debermudez approved these changes May 29, 2026

View reviewed changes

FrankD412 merged commit a90d154 into ai-dynamo:main May 29, 2026
28 of 29 checks passed

FrankD412 deleted the feat-power-efficiency-metrics branch May 29, 2026 18:11

Uh oh!

Conversation

FrankD412 commented Apr 4, 2026 • edited by devin-ai-integration Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

copy-pr-bot Bot commented Apr 4, 2026

github-actions Bot commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Try out this PR

coderabbitai Bot commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Poem

This comment was marked as resolved.

Uh oh!

codecov Bot commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

FrankD412 commented Apr 6, 2026

FrankD412 commented Apr 13, 2026

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

FrankD412 commented Apr 16, 2026

trtllm-bench

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

Uh oh!

debermudez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FrankD412 commented Apr 28, 2026

Uh oh!

Uh oh!

debermudez left a comment

Choose a reason for hiding this comment

Code Review — PR #803 (power-efficiency metrics)

High — Public metrics added without docs updates (ergonomics Axis 6)

Low — tools/ debug runner duplication

Runtime reproduction (deferred — should run on a GPU host pre-merge)

What's good

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

FrankD412 commented May 28, 2026

debermudez left a comment

Choose a reason for hiding this comment

Uh oh!

Labels

4 participants

FrankD412 commented Apr 4, 2026 •

edited by devin-ai-integration Bot

Loading

github-actions Bot commented Apr 4, 2026 •

edited

Loading

coderabbitai Bot commented Apr 4, 2026 •

edited

Loading

codecov Bot commented Apr 4, 2026 •

edited

Loading

`trtllm-bench`

Low — `tools/` debug runner duplication