AIPerf — Release 0.10.0
Summary
AIPerf 0.10.0 is centered on adaptive sweep orchestration, accuracy benchmarks, and power metrics. The release introduces a YAML-native v2 config and an adaptive sweep orchestrator with Bayesian Optimization and search recipes, lands four new DeepEval/lighteval-backed accuracy benchmarks (HellaSwag, BigBench-Hard, AIME 2024, AIME 2025), ships an initial power-metrics implementation, and adds API surface: a new /api/run endpoint exposing run-identity metadata and a fix that keeps /api/results open after benchmark completion. The mock server gains --record-requests for per-request ISL/OSL capture. Ops adds a published aiperf-nightly wheel alongside aiperf and trims the runtime Docker image by excluding the dev dependency group. Security improvements include hardening template path reads against path traversal and tightening the sensitive-token redact list so LLM token-count flags are no longer matched. Fixes cover preserving auth on CLI concurrency sweeps, BurstGPT CSV auto-detection, ShareGPT multi-turn handling, the plot data loader and model-name handling, profile image-source CLI restoration, tokenizer upgrade hints, and a handful of CI / nightly / docs fixes. Late cherry-picks to release/0.10.0 harden handling of unrecognized OpenAI response object types, make the dashboard and PNG export skip plots with unavailable data instead of erroring, and pull in the Fern release-documentation build.
Key highlights
- Config & orchestration: YAML-native v2 config + adaptive sweep orchestrator with Bayesian Optimization and search recipes (#912).
- Accuracy benchmarks: HellaSwag (#923), BigBench-Hard (#924), AIME 2024 (#925), AIME 2025 (#926).
- Power metrics: Initial implementation of power metrics in aiperf (#803).
- API: /api/run endpoint exposing run-identity metadata (#997); /api/results listener stays open after benchmark completes (DYN-701) (#989).
- Mock server: --record-requests for per-request ISL/OSL capture (#962).
- Nightly & ops: aiperf-nightly wheel published alongside aiperf (#914); runtime Docker image excludes dev dependency group (#1012); GitLab nightly trigger switched to PIPELINE_TYPE=nightly (#1011).
- Security: Template path read hardened against path traversal (#977); sensitive-token redact list tightened so LLM token-count flags are no longer matched (#1006).
Features and enhancements
Config and sweep orchestration
- YAML-native v2 config + adaptive sweep orchestrator with Bayesian Optimization (BO) and search recipes — #912 (@ajcasagrande).
Accuracy benchmarks
- HellaSwag DeepEval-backed benchmark + ExactMatch grader (AIP-877) — #923 (@debermudez).
- BigBench-Hard DeepEval-backed benchmark (AIP-878) — #924 (@debermudez).
- AIME 2024 lighteval-backed benchmark (AIP-875) — #925 (@debermudez).
- AIME 2025 lighteval-backed benchmark (AIP-876) — #926 (@debermudez).
Metrics
- Initial implementation of power metrics in aiperf — #803 (@FrankD412).
API
- /api/run endpoint exposing run-identity metadata — #997 (@matthewkotila).
- Keep /api/results listener open after benchmark completes (DYN-701) — #989 (@FrankD412).
Mock server
- --record-requests for per-request ISL/OSL capture — #962 (@FrankD412).
Build, runtime, and nightly
- Publish aiperf-nightly wheel alongside aiperf — #914 (@saturley-hall).
- Docker: exclude dev dependency group from runtime image — #1012 (@saturley-hall).
- CI (nightly): switch GitLab trigger to PIPELINE_TYPE=nightly — #1011 (@saturley-hall).
- Raise PROFILE_CONFIGURE_TIMEOUT default to 600s — #936 (@matthewkotila).
Security
- Harden template path read against path traversal — #977 (@FrankD412).
- Redact: tighten sensitive-token list to stop matching LLM token-count flags — #1006 (@matthewkotila).
Bug fixes and robustness
| Change | PR |
|---|---|
| Auth: preserve auth for CLI concurrency sweeps | #972 |
| Config: restore profile image source CLI flag | #975 |
| Config: auto-detect BurstGPT CSV in DatasetResolver and pin fixed_schedule regression | #984 |
| Dataset: ShareGPT multi-turn handling | #828 |
| Tokenizer: hint transformers upgrade on missing class (#960) | #971 |
| UI: add NVIDIA global theming | #995 |
| UI: model name for plot | #998 |
| UI: plot data loader (round 2) | #1004 |
| Tests: restore dcgm_fakers after lifespan tests | #999 |
| Tests: fix pytest warning filters | #990 |
| Skill: fix Codex parsing of skill | #948 |
Cherry-picks to release/0.10.0
- Endpoints: handle unrecognized OpenAI response object type without crashing — #1030 (@lkomali).
- UI: skip plots with unavailable data instead of erroring in dashboard and PNG export — #1032 (@lkomali).
Documentation
- Update server metrics reference for Dynamo / vLLM / SGLang / TRT-LLM / Triton — #974 (@ajcasagrande).
- Cherry-pick Fern release documentation build — #1033 (@FrankD412).
Dependencies, chore, and tooling
- Bump aiperf version to 0.10.0 — #953 (@saturley-hall).
- CI: fix Fern docs version publishing — #949 (@nealvaidya).
- CI: export MALLOC_ARENA_MAX=2 before pytest for component_integration — #950 (@ajcasagrande).
- CI: fix test selection for adversarial and recipe suites — #957 (@ajcasagrande).
- Nightly: fix nightly by fixing test_docs end-to-end test suite and staging locations — #965 (@saturley-hall).
- Nightly: unblank CONTAINER_IMAGE in GitLab trigger + forward ARTIFACTORY_REPO_NAME — #973 (@saturley-hall).
- Rename ARTIFACTORY_REPO_NAME → ARTIFACTORY_PYPI_REPO_NAME — #978 (@saturley-hall).
Full changelog
Full changelog: v0.9.0…v0.10.0