Skip to content

AIPerf v0.10.0 Release

Latest

Choose a tag to compare

@saturley-hall saturley-hall released this 09 Jun 14:35
bf70bd9

AIPerf — Release 0.10.0

Summary

AIPerf 0.10.0 is centered on adaptive sweep orchestration, accuracy benchmarks, and power metrics. The release introduces a YAML-native v2 config and an adaptive sweep orchestrator with Bayesian Optimization and search recipes, lands four new DeepEval/lighteval-backed accuracy benchmarks (HellaSwag, BigBench-Hard, AIME 2024, AIME 2025), ships an initial power-metrics implementation, and adds API surface: a new /api/run endpoint exposing run-identity metadata and a fix that keeps /api/results open after benchmark completion. The mock server gains --record-requests for per-request ISL/OSL capture. Ops adds a published aiperf-nightly wheel alongside aiperf and trims the runtime Docker image by excluding the dev dependency group. Security improvements include hardening template path reads against path traversal and tightening the sensitive-token redact list so LLM token-count flags are no longer matched. Fixes cover preserving auth on CLI concurrency sweeps, BurstGPT CSV auto-detection, ShareGPT multi-turn handling, the plot data loader and model-name handling, profile image-source CLI restoration, tokenizer upgrade hints, and a handful of CI / nightly / docs fixes. Late cherry-picks to release/0.10.0 harden handling of unrecognized OpenAI response object types, make the dashboard and PNG export skip plots with unavailable data instead of erroring, and pull in the Fern release-documentation build.

Key highlights

  • Config & orchestration: YAML-native v2 config + adaptive sweep orchestrator with Bayesian Optimization and search recipes (#912).
  • Accuracy benchmarks: HellaSwag (#923), BigBench-Hard (#924), AIME 2024 (#925), AIME 2025 (#926).
  • Power metrics: Initial implementation of power metrics in aiperf (#803).
  • API: /api/run endpoint exposing run-identity metadata (#997); /api/results listener stays open after benchmark completes (DYN-701) (#989).
  • Mock server: --record-requests for per-request ISL/OSL capture (#962).
  • Nightly & ops: aiperf-nightly wheel published alongside aiperf (#914); runtime Docker image excludes dev dependency group (#1012); GitLab nightly trigger switched to PIPELINE_TYPE=nightly (#1011).
  • Security: Template path read hardened against path traversal (#977); sensitive-token redact list tightened so LLM token-count flags are no longer matched (#1006).

Features and enhancements

Config and sweep orchestration

  • YAML-native v2 config + adaptive sweep orchestrator with Bayesian Optimization (BO) and search recipes#912 (@ajcasagrande).

Accuracy benchmarks

  • HellaSwag DeepEval-backed benchmark + ExactMatch grader (AIP-877) — #923 (@debermudez).
  • BigBench-Hard DeepEval-backed benchmark (AIP-878) — #924 (@debermudez).
  • AIME 2024 lighteval-backed benchmark (AIP-875) — #925 (@debermudez).
  • AIME 2025 lighteval-backed benchmark (AIP-876) — #926 (@debermudez).

Metrics

  • Initial implementation of power metrics in aiperf — #803 (@FrankD412).

API

  • /api/run endpoint exposing run-identity metadata#997 (@matthewkotila).
  • Keep /api/results listener open after benchmark completes (DYN-701) — #989 (@FrankD412).

Mock server

  • --record-requests for per-request ISL/OSL capture#962 (@FrankD412).

Build, runtime, and nightly

Security

  • Harden template path read against path traversal#977 (@FrankD412).
  • Redact: tighten sensitive-token list to stop matching LLM token-count flags — #1006 (@matthewkotila).

Bug fixes and robustness

Change PR
Auth: preserve auth for CLI concurrency sweeps #972
Config: restore profile image source CLI flag #975
Config: auto-detect BurstGPT CSV in DatasetResolver and pin fixed_schedule regression #984
Dataset: ShareGPT multi-turn handling #828
Tokenizer: hint transformers upgrade on missing class (#960) #971
UI: add NVIDIA global theming #995
UI: model name for plot #998
UI: plot data loader (round 2) #1004
Tests: restore dcgm_fakers after lifespan tests #999
Tests: fix pytest warning filters #990
Skill: fix Codex parsing of skill #948

Cherry-picks to release/0.10.0

  • Endpoints: handle unrecognized OpenAI response object type without crashing — #1030 (@lkomali).
  • UI: skip plots with unavailable data instead of erroring in dashboard and PNG export — #1032 (@lkomali).

Documentation

  • Update server metrics reference for Dynamo / vLLM / SGLang / TRT-LLM / Triton#974 (@ajcasagrande).
  • Cherry-pick Fern release documentation build#1033 (@FrankD412).

Dependencies, chore, and tooling

  • Bump aiperf version to 0.10.0#953 (@saturley-hall).
  • CI: fix Fern docs version publishing — #949 (@nealvaidya).
  • CI: export MALLOC_ARENA_MAX=2 before pytest for component_integration — #950 (@ajcasagrande).
  • CI: fix test selection for adversarial and recipe suites#957 (@ajcasagrande).
  • Nightly: fix nightly by fixing test_docs end-to-end test suite and staging locations — #965 (@saturley-hall).
  • Nightly: unblank CONTAINER_IMAGE in GitLab trigger + forward ARTIFACTORY_REPO_NAME — #973 (@saturley-hall).
  • Rename ARTIFACTORY_REPO_NAME → ARTIFACTORY_PYPI_REPO_NAME — #978 (@saturley-hall).

Full changelog

Full changelog: v0.9.0…v0.10.0