feat: Initial implementation of power metrics in aiperf.#803
Conversation
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@8a9d8aa680fea00293f8856c280c212c2c70cc80Recommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@8a9d8aa680fea00293f8856c280c212c2c70cc80Last updated for commit: |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughAdds GPU efficiency metric computation and an explicit collector entrypoint: a new accumulator method computes total GPU power, energy, and output-tokens-per-joule; collector collection method made public for on-demand baseline captures; records and manager invoke these flows. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
|
After some local testing, some of the metrics show up but not all. I need to figure out how to run a metric calculation for energy/token and have it show up. |
1238e5e to
deaf3a6
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/superpowers/plans/2026-04-13-monitor.md`:
- Around line 13-28: Replace the ASCII table under "File Map" with a mermaid
diagram block (```mermaid ... ```) that groups the created source files, the
modified source file, and the created test files as subgraphs; include nodes for
monitor/__init__.py (Package marker), monitor/models.py (GpuSnapshot model),
monitor/gpu.py (query_gpus), monitor/api.py (create_app), monitor/poller.py
(CsvPoller), cli_commands/monitor.py (CLI subcommand), the modified cli.py
(Register monitor command), and the three test files
tests/unit/monitor/__init__.py, test_gpu.py, test_api.py, and test_poller.py,
following the suggested structure so the diagram replaces the table exactly.
In `@src/aiperf/gpu_telemetry/manager.py`:
- Around line 212-215: The broad exception handler in manager.py (the "except
Exception as e" that calls self.warning(f"GPU Telemetry: Failed to capture
baseline from pynvml: {e}")) is missing the BLE001 suppression comment; add "#
noqa: BLE001" to that except line to match other handlers and include the same
short explanatory comment used at lines 240/334/428, so the except Exception as
e line becomes explicitly marked as intentionally broad while keeping the
existing self.warning call intact.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: cd0f4da4-9c3d-4a42-8ace-c06012576f1b
📒 Files selected for processing (9)
docs/superpowers/plans/2026-04-13-monitor.mddocs/superpowers/specs/2026-04-13-monitor-design.mdsrc/aiperf/gpu_telemetry/accumulator.pysrc/aiperf/gpu_telemetry/manager.pysrc/aiperf/gpu_telemetry/protocols.pysrc/aiperf/gpu_telemetry/pynvml_collector.pysrc/aiperf/records/records_manager.pytests/unit/gpu_telemetry/test_pynvml_collector.pytests/unit/gpu_telemetry/test_telemetry_manager.py
✅ Files skipped from review due to trivial changes (1)
- docs/superpowers/specs/2026-04-13-monitor-design.md
🚧 Files skipped from review as they are similar to previous changes (3)
- src/aiperf/records/records_manager.py
- src/aiperf/gpu_telemetry/protocols.py
- src/aiperf/gpu_telemetry/accumulator.py
|
Ok, I was concerned when CSV Export: /home/scratch.fdinatale_coreai/code/aiperf/h200/artifacts/meta-llama_Llama-3.1-70B-Instruct-openai-completions-concurrency1024-request_rateinf/profile_export_aiperf.csv
JSON Export: /home/scratch.fdinatale_coreai/code/aiperf/h200/artifacts/meta-llama_Llama-3.1-70B-Instruct-openai-completions-concurrency1024-request_rateinf/profile_export_aiperf.json
Log File: /home/scratch.fdinatale_coreai/code/aiperf/h200/artifacts/meta-llama_Llama-3.1-70B-Instruct-openai-completions-concurrency1024-request_rateinf/logs/aiperf.log
GPU 0 (NVIDIA H200): 143653.77 J
GPU 1 (NVIDIA H200): 147784.43 J
GPU 2 (NVIDIA H200): 149026.62 J
GPU 3 (NVIDIA H200): 148156.23 J
GPU 4 (NVIDIA H200): 145397.66 J
GPU 5 (NVIDIA H200): 148267.18 J
GPU 6 (NVIDIA H200): 153568.17 J
GPU 7 (NVIDIA H200): 149799.60 J
Total energy: 1185653.65 JEnergy reported by
|
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/aiperf/gpu_telemetry/accumulator.py`:
- Around line 337-368: The current per-GPU metric extraction only catches
NoMetricValue in the blocks calling gpu_data.get_metric_result for
"gpu_power_usage" and "energy_consumption" (power_result / energy_result),
allowing any other exception to bubble up and abort processing; update those
try/except blocks to also catch generic Exception as e (after NoMetricValue) and
log the unexpected error (include the gpu_uuid and exception text) via
self.debug or self.error so the loop continues to the next GPU and overall
efficiency metric aggregation isn't aborted.
- Around line 370-372: The early return when power_gpu_count == 0 in the
accumulator (around power_gpu_count check) is dropping metrics that don't
require instantaneous power (e.g., total_gpu_energy and
output_tokens_per_joule); change the logic in the method containing that check
(look for power_gpu_count, total_gpu_energy, output_tokens_per_joule) so you no
longer return [] when power data is missing—instead compute and include metrics
that can be derived from available energy data and only skip or set to None
those metrics that specifically require gpu_power_usage/instantaneous power;
ensure the final returned list still contains non-power metrics and that
power-dependent fields are omitted or null when power_gpu_count == 0.
- Around line 325-329: The code incorrectly reads
total_output_tokens_result.avg; since TotalOutputTokensMetric inherits
DerivedSumMetric and the derived scalar is stored as a sum, change the
assignment that computes total_output_tokens to read
total_output_tokens_result.sum instead of .avg (locate the
total_output_tokens_result variable and the total_output_tokens assignment near
the TotalOutputTokensMetric / DerivedSumMetric usage) so the value semantically
matches the DerivedSumMetric behavior.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a0bffc0c-ed37-496c-ad35-c82557ac1743
📒 Files selected for processing (1)
src/aiperf/gpu_telemetry/accumulator.py
aea6132 to
279c92d
Compare
5da4623 to
1ae073e
Compare
debermudez
left a comment
There was a problem hiding this comment.
Four issues found during review.
db6810f to
42b1fe5
Compare
Alright -- fixed the issues. |
320f376 to
32f2c35
Compare
2daad4f to
545a3e9
Compare
Operator can now point both power-measurement serve drivers at a raw_payload
JSONL file (produced e.g. by tools/trtllm_bench_to_aiperf.py) instead of
using the built-in synthetic dataset generation. When --input-file is set:
- aiperf is invoked with --custom-dataset-type raw_payload --input-file <path>
- --synthetic-input-tokens-* and --output-tokens-* flags are dropped
- --extra-inputs min_tokens/max_tokens trailers are dropped (per-row
max_tokens in the JSONL dictates generation length)
- --extra-inputs ignore_eos:true stays (server-behavior knob)
--isl and --osl are now sentinel-default None and validated as mutually
exclusive with --input-file (rc=2 on collision). When --input-file is unset
both are backfilled to 1000 in main(), so existing invocations are
byte-for-byte unchanged.
--endpoint-type is already a CLI flag (default completions); the operator
passes --endpoint-type chat themselves when using --input-file since aiperf's
raw_payload loader hard-requires `messages`-shaped JSONL lines. The driver
does not auto-flip.
summary.json gains an "input_file" key in both drivers so the artifact is
self-describing in both modes. The resolved absolute path is stored back
into args.input_file after the existence check so summary.json and the
aiperf invocation agree on the path.
Spec: docs/superpowers/specs/2026-05-27-power-driver-input-file-design.md
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
debermudez
left a comment
There was a problem hiding this comment.
Code Review — PR #803 (power-efficiency metrics)
Reviewed against b4f1f58c (current PR head). Two passes combined:
- Branch code review — 10 findings (3 High / 6 Medium / 1 Low).
- LLM-ergonomics review — 8 findings (2 High / 5 Medium / 1 Low).
Most findings anchor to specific lines and are posted inline. Two cross-cutting items don't anchor cleanly:
High — Public metrics added without docs updates (ergonomics Axis 6)
This PR adds three new public metrics (total_gpu_power, total_gpu_energy, output_tokens_per_joule), a new unit string (tokens/J), and a new "externally-injected derived metric" pattern, but no docs are updated. CLAUDE.md mandates: "DOCUMENTATION IS REQUIRED, NOT OPTIONAL ... Metrics definitions or formulas → docs/metrics-reference.md". Suggested:
- Add a
## GPU Power Efficiency Metricssection todocs/metrics-reference.mdwith the three tags, units, and formulas (Σ avg(gpu_power_usage),Σ delta(energy_consumption),total_output_tokens / total_gpu_energy). - Add an
### Externally-Injected Derived Metricentry todocs/dev/patterns.mdcitingpower_efficiency_metrics.pyas the reference file. - If the pattern warrants it, sync across the four-file rule (
AGENTS.md/CLAUDE.md/.github/copilot-instructions.md/.cursor/rules/python.mdc) and runmake check-agent-files-sync.
Low — tools/ debug runner duplication
tools/run_trtllm_bench_with_power.py, tools/run_trtllm_serve_with_power.py, and tools/run_trtllm_serve_with_power_dcgm.py each define their own http_get / http_post / wait_for_server / write_json helpers. Extracting tools/_power_runner_common.py would prevent helper drift (though tools/ is out of strict ergonomics scope).
Runtime reproduction (deferred — should run on a GPU host pre-merge)
# F-04: degenerate phase_stats
aiperf profile -m <model> --no-warmup --request-count 0 --gpu-telemetry pynvml \
--artifact_dir /tmp/aiperf-f04 --ui simple
# F-08: GPU flags absent when telemetry disabled
aiperf profile -m <model> --request-count 50 --no-gpu-telemetry \
--artifact_dir /tmp/aiperf-f08 --ui simple
# expect: profile_export_aiperf.json does NOT contain total_gpu_power / total_gpu_energy / output_tokens_per_joule
# F-02: cross-phase energy leak (requires multi-phase plan)
aiperf profile ... --plan two-phase.yaml ...
# expect: phase 2's total_gpu_energy reflects only phase 2 deltaWhat's good
- The asymmetric gauge/counter time-window comment at
accumulator.py:407-408is exactly the "why, not what" comment styleCLAUDE.mdprescribes, andtest_energy_filter_widens_end_ns_while_power_filter_stays_bounded(test_accumulator.py:549) locks the invariant in. _capture_collector_baseline(manager.py:266-292) cleanly relocatescollector.initialize()to the baseline-capture path so the energy counter has a baseline before profile start.- Lambda-wrapped debug logs (
accumulator.py:325-338, 364-380) match the "lambda for expensive logs" rule and correctly useuuid=gpu_uuidto dodge the loop-variable closure bug. - Mechanical floor is clean:
check-ergonomics(257 baselined, 0 new) andruff-baselined(305 baselined, 0 new) both pass.
Full review artifacts (local-only): artifacts/code-review.md and artifacts/code-review-2026-05-27/llm-ergonomics-pr-803-feat-power-efficiency-metrics.md.
Replace the magic 1e6 conversion in _sum_gpu_energy_joules with the unit-defined factor so a future rename or rescale of EnergyMetricUnit cannot silently drift the energy aggregation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
…ract Add Returns and Example blocks spelling out the 0-3 per-tag omission conditions, clarify that metric_results is read-only and scanned only for total_output_tokens, document the intentional energy-window widening, and cross-reference sibling summarize() to distinguish sync cross-GPU aggregate (one per phase) from async per-GPU emission (periodic). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
…cUnit Replace the hardcoded "tokens/J" literal in compute_efficiency_metrics with str(GenericMetricUnit.TOKENS_PER_JOULE) so the accumulator-emitted unit can't drift from OutputTokensPerJouleMetric.unit on a rename. Add test_emitted_units_match_metric_class_units pinning the relationship for all three power-efficiency tags rather than the literal string, and update the existing happy-path assertion to use the enum form for the same reason. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
…s block
The Protocol's Returns block described an all-or-nothing contract
("Empty if no GPU data is available for any metric.."), but the
implementation in GPUTelemetryAccumulator returns 0-3 results with
independent per-tag omission. Rewrite to match (and fix the trailing
double-period typo) so agents implementing the Protocol model the
actual behavior.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Rename len_records_results -> records_completed and document why the snapshot is taken before records_results.extend(efficiency_metrics): ProcessRecordsResult.completed must reflect request-derived records only, not derived aggregates (total_gpu_power, total_gpu_energy, output_tokens_per_joule). Add test_completed_excludes_efficiency_metrics in TestRecordsManagerEfficiencyMetricsSnapshot to guard against a future PR reordering or removing the snapshot. Verified the test fails when the snapshot is moved after the extend. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
…aders
MetricResult.count is stripped to None by to_json_result for DERIVED
metrics (record_models.py:70-90), so a "1 of 2 GPUs reporting" run was
indistinguishable from "2 of 2" in the CLI audit surface. Bake the
valid-data count into each header via a small _gpu_count_suffix helper:
total_gpu_power -> "Total GPU Power (N GPUs)"
total_gpu_energy -> "Total GPU Energy (N GPUs)"
output_tokens_per_joule -> "Output Tokens per Joule (N GPUs)"
Singular form ("1 GPU") is used when N == 1. Existing happy-path tests
gain header assertions; new test_header_reflects_partial_cohort_count
pins the "1 of 2" surface so future regressions are caught.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Three convergent issues across TotalGpuPowerMetric, TotalGpuEnergyMetric,
and OutputTokensPerJouleMetric — same fix per class:
- Class docstring gains an explicit `Invariant:` paragraph naming the
injection site (GPUTelemetryAccumulator.compute_efficiency_metrics)
and the catching code path
(MetricResultsProcessor.update_derived_metrics). Future contributors
copy-pasting these as "the derived-metric pattern" now see the
contract spelled out.
- `_derive_value` return annotation `-> float` -> `-> NoReturn` (typing).
The body unconditionally raises, so the previous signature lied; the
truthful form lets type-checkers reason about unreachable downstream
code.
- `NoMetricValue` message rewritten to name the operation
(derive from MetricResultsDict), the injection site, and the path
that's expected to catch and skip. The previous form named only the
source ("X is computed by the GPU telemetry accumulator") and gave
agents no clue where the contract is enforced.
Add tests/unit/metrics/test_power_efficiency_metrics.py with a
parametrized contract test pinning each of the three error-message
invariants (tag, operation, injection-site) so a future weakening of
the message is caught.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
…window _process_results previously constructed a TimeRangeFilter for the efficiency-metrics block with two consecutive time.time_ns() fallbacks when phase_stats.start_ns / requests_end_ns were both None. The two calls land nanoseconds apart, producing an effectively zero-width window. Total GPU power (gauge-based) would then either emit a misleading 0.0W as a valid-looking metric or be silently dropped, depending on telemetry sample jitter. Extract the efficiency-metrics block into _apply_gpu_efficiency_metrics with an explicit guard: when either bound is None (the degenerate "no records flowed for this phase" case), log a warning naming the phase and the missing fields and skip the compute call. When both bounds are set, honor them directly with no fallback. Extracting also keeps _process_results under the PLR0912 branch budget that the inline guard would have pushed it over. Add TestRecordsManagerEfficiencyMetricsDegeneratePhase with two cases: both bounds None, and only one None (partial-degenerate). Sanity-checked by reverting the guard — the test failed with the exact ~1us-wide TimeRangeFilter the reviewer described. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
…race compute_efficiency_metrics previously built the energy filter with end_ns=None — the intent was to widen the window by one collection cadence so the trailing counter scrape that lands just after requests_end_ns is included in the delta. The unbounded form, however, captures every post-phase sample present in the (append-only) TelemetryHierarchy: late-finalization drift absorbs cooldown samples into the phase's energy, and in multi-phase runs an earlier-phase window can reach forward into a later phase's samples if the finalization ordering ever changes. Replace end_ns=None with end_ns = phase_end + Environment.GPU.FINAL_SCRAPE_GRACE_NS (new env var, default 666_000_000 ns ~= 2x the default 333 ms COLLECTION_INTERVAL, configurable via AIPERF_GPU_FINAL_SCRAPE_GRACE_NS). Move the rationale from the call site into _sum_gpu_energy_joules where the counter math actually consumes the filter, per reviewer preference. Update test_energy_filter_widens_end_ns_..._stays_bounded to pin the new bounded form, and add test_repeated_calls_use_bounded_energy_window_per_phase: simulates two consecutive phases and asserts each call's energy filter ends at phase.end_ns + grace, not None. The latter test also asserts the grace window stays inside the phase-2 start. Sanity-checked by restoring end_ns=None — both tests fail with "None == (5000000000 + 666000000)", exactly the leak the reviewer described. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Add a GPU Power Efficiency Metrics section to metrics-reference.md
covering the three tags introduced by this PR — total_gpu_power,
total_gpu_energy, and output_tokens_per_joule. Each entry follows the
existing per-metric format (Type, Formula, Notes) and links to the
new Externally-Injected Derived Metric pattern in docs/dev/patterns.md
(added in a follow-up commit) so readers can find the contract that
explains why _derive_value never fires for these tags.
The section also documents:
- The bounded energy-window math via FINAL_SCRAPE_GRACE_NS.
- The "(N GPUs)" header attribution that distinguishes partial-cohort
runs from full ones.
- The negative-delta clamp on counter resets.
- The independent per-tag omission contract.
Update the Table of Contents to include the new section and its three
subsection anchors.
Per CLAUDE.md "DOCUMENTATION IS REQUIRED, NOT OPTIONAL ... Metrics
definitions or formulas -> docs/metrics-reference.md".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Add an Externally-Injected Derived Metric Pattern section to
docs/dev/patterns.md, slotted alongside the other specialized contract
patterns (NaN/Inf Discipline, Safe Filesystem Reads). The new entry
captures the three-part contract introduced by the power_efficiency_metrics
classes:
1. Invariant: paragraph in the class docstring naming the injection
site and the catching path.
2. _derive_value returns NoReturn from typing — the body unconditionally
raises, so the truthful annotation is NoReturn rather than the
metric's value type.
3. NoMetricValue message names the operation (Cannot derive 'X' from
MetricResultsDict), the injection site
(GPUTelemetryAccumulator.compute_efficiency_metrics), and the
catching path (MetricResultsProcessor.update_derived_metrics).
Reference file is power_efficiency_metrics.py; injection site is
GPUTelemetryAccumulator.compute_efficiency_metrics. The new metrics-
reference section added in the previous commit links here.
Skipped the four-file sync (AGENTS.md / CLAUDE.md / copilot-instructions
/ python.mdc): the reviewer hedged with "if the pattern warrants it",
and this is a narrow pattern specific to GPU telemetry + externally
computed values, not a project-wide coding standard. The four files
already direct readers to docs/dev/patterns.md for code-pattern lookups
via the documentation table.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
…etrics
Add a fourth power-efficiency metric that reports per-user energy
footprint: total_gpu_energy / configured concurrency. Lower is better
— a more efficient deployment serves the same load for less energy
per concurrent user.
- New GenericMetricUnit.JOULES_PER_USER ("joules/user").
- New EnergyPerUserMetric class in power_efficiency_metrics.py
following the externally-injected derived metric pattern
(Invariant: docstring, _derive_value returns NoReturn, structured
error message). display_order=903, flags=MetricFlags.NONE
(smaller-is-better default).
- GPUTelemetryAccumulator.compute_efficiency_metrics now reads the
profiling phase's concurrency via run.cfg.get_profiling_phases()
and emits energy_per_user when both concurrency is a positive int
AND energy_count > 0. Concurrency=None (e.g. pure --request-rate
runs) and missing energy data both yield omission, consistent
with the other power-efficiency metrics' independent-omission
contract.
Tests: happy-path now expects 4 metrics (default test config resolves
to concurrency=1); multiple-GPUs gains energy_per_user assertion;
partial-cohort header check covers the new tag; unit-relationship
test extends to EnergyPerUserMetric; three new tests cover concurrency
scaling, None-omission, and no-energy-data omission. The _derive_value
contract test parametrize gains the new class.
Docs: new "Energy per User" subsection in docs/metrics-reference.md
under GPU Power Efficiency Metrics; TOC and section intro updated.
Function size: compute_efficiency_metrics grew past the 80-line
ergonomics cap (101 lines now) and accumulator.py past the 500-line
file cap (511 lines). Both baselined pending a follow-up refactor
that splits efficiency-metrics construction out of the accumulator.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
|
Confirmed that the |
This PR adds some basic power metrics leveraging GPU telemetry to compute the amount of power used during the benchmark.
Summary by CodeRabbit
New Features
Tests
Chores