AIPerf — Release 0.10.0

Summary

AIPerf 0.10.0 is centered on adaptive sweep orchestration, accuracy benchmarks, and power metrics. The release introduces a YAML-native v2 config and an adaptive sweep orchestrator with Bayesian Optimization and search recipes, lands four new DeepEval/lighteval-backed accuracy benchmarks (HellaSwag, BigBench-Hard, AIME 2024, AIME 2025), ships an initial power-metrics implementation, and adds API surface: a new /api/run endpoint exposing run-identity metadata and a fix that keeps /api/results open after benchmark completion. The mock server gains --record-requests for per-request ISL/OSL capture. Ops adds a published aiperf-nightly wheel alongside aiperf and trims the runtime Docker image by excluding the dev dependency group. Security improvements include hardening template path reads against path traversal and tightening the sensitive-token redact list so LLM token-count flags are no longer matched. Fixes cover preserving auth on CLI concurrency sweeps, BurstGPT CSV auto-detection, ShareGPT multi-turn handling, the plot data loader and model-name handling, profile image-source CLI restoration, tokenizer upgrade hints, and a handful of CI / nightly / docs fixes. Late cherry-picks to release/0.10.0 harden handling of unrecognized OpenAI response object types, make the dashboard and PNG export skip plots with unavailable data instead of erroring, and pull in the Fern release-documentation build.

Key highlights

Config & orchestration: YAML-native v2 config + adaptive sweep orchestrator with Bayesian Optimization and search recipes (#912).
Accuracy benchmarks: HellaSwag (#923), BigBench-Hard (#924), AIME 2024 (#925), AIME 2025 (#926).
Power metrics: Initial implementation of power metrics in aiperf (#803).
API: /api/run endpoint exposing run-identity metadata (#997); /api/results listener stays open after benchmark completes (DYN-701) (#989).
Mock server: --record-requests for per-request ISL/OSL capture (#962).
Nightly & ops: aiperf-nightly wheel published alongside aiperf (#914); runtime Docker image excludes dev dependency group (#1012); GitLab nightly trigger switched to PIPELINE_TYPE=nightly (#1011).
Security: Template path read hardened against path traversal (#977); sensitive-token redact list tightened so LLM token-count flags are no longer matched (#1006).

Features and enhancements

Config and sweep orchestration

YAML-native v2 config + adaptive sweep orchestrator with Bayesian Optimization (BO) and search recipes — #912 (@ajcasagrande).

Accuracy benchmarks

HellaSwag DeepEval-backed benchmark + ExactMatch grader (AIP-877) — #923 (@debermudez).
BigBench-Hard DeepEval-backed benchmark (AIP-878) — #924 (@debermudez).
AIME 2024 lighteval-backed benchmark (AIP-875) — #925 (@debermudez).
AIME 2025 lighteval-backed benchmark (AIP-876) — #926 (@debermudez).

Metrics

Initial implementation of power metrics in aiperf — #803 (@FrankD412).

API

/api/run endpoint exposing run-identity metadata — #997 (@matthewkotila).
Keep /api/results listener open after benchmark completes (DYN-701) — #989 (@FrankD412).

Mock server

--record-requests for per-request ISL/OSL capture — #962 (@FrankD412).

Build, runtime, and nightly

Publish aiperf-nightly wheel alongside aiperf — #914 (@saturley-hall).
Docker: exclude dev dependency group from runtime image — #1012 (@saturley-hall).
CI (nightly): switch GitLab trigger to PIPELINE_TYPE=nightly — #1011 (@saturley-hall).
Raise PROFILE_CONFIGURE_TIMEOUT default to 600s — #936 (@matthewkotila).

Security

Harden template path read against path traversal — #977 (@FrankD412).
Redact: tighten sensitive-token list to stop matching LLM token-count flags — #1006 (@matthewkotila).

Bug fixes and robustness

Change	PR
Auth: preserve auth for CLI concurrency sweeps	#972
Config: restore profile image source CLI flag	#975
Config: auto-detect BurstGPT CSV in DatasetResolver and pin fixed_schedule regression	#984
Dataset: ShareGPT multi-turn handling	#828
Tokenizer: hint transformers upgrade on missing class (#960)	#971
UI: add NVIDIA global theming	#995
UI: model name for plot	#998
UI: plot data loader (round 2)	#1004
Tests: restore dcgm_fakers after lifespan tests	#999
Tests: fix pytest warning filters	#990
Skill: fix Codex parsing of skill	#948

Cherry-picks to release/0.10.0

Endpoints: handle unrecognized OpenAI response object type without crashing — #1030 (@lkomali).
UI: skip plots with unavailable data instead of erroring in dashboard and PNG export — #1032 (@lkomali).

Documentation

Update server metrics reference for Dynamo / vLLM / SGLang / TRT-LLM / Triton — #974 (@ajcasagrande).
Cherry-pick Fern release documentation build — #1033 (@FrankD412).

Dependencies, chore, and tooling

Bump aiperf version to 0.10.0 — #953 (@saturley-hall).
CI: fix Fern docs version publishing — #949 (@nealvaidya).
CI: export MALLOC_ARENA_MAX=2 before pytest for component_integration — #950 (@ajcasagrande).
CI: fix test selection for adversarial and recipe suites — #957 (@ajcasagrande).
Nightly: fix nightly by fixing test_docs end-to-end test suite and staging locations — #965 (@saturley-hall).
Nightly: unblank CONTAINER_IMAGE in GitLab trigger + forward ARTIFACTORY_REPO_NAME — #973 (@saturley-hall).
Rename ARTIFACTORY_REPO_NAME → ARTIFACTORY_PYPI_REPO_NAME — #978 (@saturley-hall).

Full changelog

Full changelog: v0.9.0…v0.10.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

AIPerf v0.10.0 Release

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

AIPerf — Release 0.10.0

Summary

Key highlights

Features and enhancements

Config and sweep orchestration

Accuracy benchmarks

Metrics

API

Mock server

Build, runtime, and nightly

Security

Bug fixes and robustness

Cherry-picks to release/0.10.0

Documentation

Dependencies, chore, and tooling

Full changelog

Uh oh!