feat: YAML-native v2 config + adaptive sweep orchestrator with BO & search recipes#912
Merged
Conversation
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@c357b7c48b1ec429317782209820491f88ecf396Recommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@c357b7c48b1ec429317782209820491f88ecf396Last updated for commit: |
…g, search recipes
Major reorganization that ports the parameter-sweep orchestrator into a
first-class subsystem, introduces a full YAML-driven AIPerf configuration
language alongside the existing CLI, and ships a library of reusable
"search recipes" plus a Bayesian/adaptive search-planner stack.
Orchestrator
- New executor abstraction (`orchestrator/executor.py`, `local_executor.py`)
decouples per-cell run execution from the orchestrator loop.
- New `search_planner/` package with pluggable planners:
- `bayesian.py` + `_botorch_kernel.py` (BoTorch GP / DSP kernel)
- `optuna_planner.py` + `_optuna_helpers.py` (multi-objective via Optuna)
- `monotonic.py` + `_monotonic_boundary.py`
- `smooth_isotonic.py` + `_smooth_isotonic_{fit,boundary,phases}.py`
- Shared helpers: cliff detection, margin normalization, replicate
budget, pooled percentile, SLA helpers, outcome constraints.
- Sweep aggregation grows `sweep_sla_filter.py` and multi-objective
search-history export (`exporters/search_history.py`).
- New convergence strategy hooks, JSONL loader, subprocess runner
refactors, and per-cell callbacks.
YAML configuration (`src/aiperf/config/`)
- Replaces the old `src/aiperf/common/config/` package with a layered
design: typed config models, loader (Jinja2 templating, env-var
interpolation, dotted-path overrides, duration parsing, normalizers,
strict-undefined plan), flags converter (CLI <-> YAML), resolution
layer (predicates + resolvers + plan), sweep DSL (grid, QMC/Sobol,
adaptive, multi-run, sampling, distributions), and a public JSON
schema (`config/schema/aiperf-config.schema.json`).
- Bundled template library (`config/templates/`): 20+ ready-to-run
YAMLs covering minimal, latency, goodput SLO, long context, ramping,
multi-turn, multimodal vision/audio, embeddings, fixed schedule,
trace replay, KV cache test, multi-URL load balancing, sweep with
plot, sweep distributions, warmup profiling, request cancellation,
Jinja2 variables, env-var production, inline dataset, scenario
workload profiles, GPU telemetry, HTTP trace metrics, user files,
speed bench sweep, plus reference trace JSONL data.
- Communications config split into `comm/` (TCP, IPC, dual-bind, build).
- Dataset config split into `dataset/` (content, resolver, trace,
video) with inline-record support.
Search recipes (`src/aiperf/search_recipes/`)
- New recipe registry with built-ins: `max_concurrency_under_sla`,
`max_goodput_under_slo`, `sla_breach_knee`, `itl_surface_fit`,
`ttft_curve_fit`, and Pareto sweep (axes, dominance, export, parser).
- Recipe post-process hooks with shared infrastructure.
CLI / runner
- `cli_runner.py` factored into `_cli_runner_helpers.py`,
`_cli_runner_sweep_helpers.py`, `_cli_runner_post_process.py`, plus
`_sweep_table_logger.py` for live progress rendering.
- New `aiperf config` command for template discovery and validation.
- Profile/plot/service commands updated for the new config layer.
Auto-plot envelope
- `plot/auto_plot.py` materializes a resolved plot envelope into the
artifact dir so `aiperf plot <dir>` reproduces the chart pipeline.
- `PlotEnvelopeConfig` allows a single AIPerf YAML to own its
visualization.
Finite / numeric invariants
- New `common/finite.py` with `FiniteFloat`, `scrub_non_finite`,
`nan_safe_mean`, `nan_safe_std`, `is_finite_value`.
- Property test corpus (`tests/unit/property/`) with field/bounds
baselines, finite invariants, Pydantic field fuzz, and config-dump
round-trip checks. CI ratchets these to zero.
Metrics
- New `good_request_fraction_metric` for goodput SLO recipes.
- Records, exporters, and post-processors threaded with redaction and
finite-value scrubbing on export boundaries.
Mock server (`tests/aiperf_mock_server/`)
- New deterministic `scheduler.py` for repeatable latency simulation.
- Robustness and scheduler test suites for CI coverage.
Plugins
- New orchestrator plugin categories and schema
(`plugin/schema/_orchestrator_schemas.py`,
`plugin/categories.yaml` updates) for executors, planners, recipes.
Testing
- Extensive new unit/component/integration suites under
`tests/unit/{config,orchestrator,search_recipes,search_planner,
cli_runner,property}/` and adversarial chaos scripts under
`tests/scripts/chaos/`.
- New component-integration smoke tests for Sobol sweeps,
multi-objective E2E, process-title, and recipe collapse-knee.
Docs
- New tutorials: sweeps, adaptive search, auto-plot, inline datasets,
YAML config, YAML distributions.
- New developer docs: sweep orchestrator design, global invariants,
YAML config roadmap and future goals.
- New sweeping reference: bayesian optimization, search recipes,
space-filling designs; new troubleshooting/sweeps guide.
- New API reference: search history.
- Regenerated `docs/cli-options.md` and `docs/environment-variables.md`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
20e5ca9 to
d2da6f2
Compare
Three issues caused 'Validate (and publish on main) synced docs' to fail:
1. `environment-variables.md:148` had a literal `{"tight": 20000}` JSON
example in prose. MDX parses `{...}` as a JSX expression and `"tight":
20000` is not a valid JS expression. Fixed by wrapping the override
example in `` `` `` inline code in the source description string in
`common/environment.py` (auto-generates the doc).
2. `dev/sweep-orchestrator.md:443` had a literal `<= max_iter` in a
table cell. MDX sees `<` as the start of a JSX element and chokes
on `=`. Replaced with the Unicode `≤` (already used elsewhere in
the same doc for `×`).
3. Three relative `../../src/...` links to source files broke under
`--strict-broken-links` because Fern publishes from `fern/pages-dev/`
which has no `../../src/` parent. Converted to absolute
`github.com/ai-dynamo/aiperf/blob/main/...` URLs, matching the
pattern already used in `docs/api/search-history.md` and
`docs/reproducibility.md`.
Verified locally by cloning the `docs-website` branch, syncing the PR's
`docs/` into `fern/pages-dev/`, running `md_to_mdx.py`, and getting
`fern check --warnings --strict-broken-links` to 0 errors.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
CI pre-commit `test-imports` hook was failing with:
aiperf.orchestrator.search_planner._botorch_kernel:
ModuleNotFoundError("No module named 'torch'")
torch/gpytorch live behind the `[optuna]` extra and are not installed
in the lint/pre-commit env. The module was importing them at top level
even though only `make_dsp_kernel` uses them. Moved the imports inside
the function and kept a TYPE_CHECKING import for the `ScaleKernel`
return annotation. Module now imports cleanly without the extra;
calling `make_dsp_kernel` without it still raises a clear
ModuleNotFoundError, matching the existing optuna-gated UX.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
…tests `tests/unit/config/test_v1_file_dataset_rejections.py` hardcoded ``input_file="/tmp/mc.jsonl"`` with a comment claiming "path doesn't have to exist for converter". That's false: ``CLIConfig.input_file`` runs the ``parse_file`` validator in ``src/aiperf/config/loader/parsing.py`` which requires the path to be an existing file or directory. The tests only passed on dev machines where ``/tmp/mc.jsonl`` happened to exist from prior runs; CI fails all 7 cases with ``ValidationError: '/tmp/mc.jsonl' is not a valid file or directory``. Switched to a per-test ``mc_jsonl`` fixture that creates an empty JSONL under pytest's ``tmp_path``. The converter only reads the *path*, not the file contents, so an empty file is sufficient. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Keep the four agent instruction files synchronized while moving detailed finite-value guidance to the canonical docs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
debermudez
reviewed
May 14, 2026
debermudez
left a comment
Contributor
There was a problem hiding this comment.
This is a thorough, well-structured architectural overhaul. The new MultiRunOrchestrator, v2 config system, and adaptive search planner are solid. The issues below are mostly low-severity cleanup items, with one medium-priority data-loss risk in the adaptive cancel path.
13 findings: 0 critical, 0 high, 1 medium, 9 low, 3 nit.
Merge the latest mainline changes into the sweep orchestrator branch while keeping the branch's config-v2 model authoritative. OTel and MLflow now live as first-class benchmark config groups with runtime call sites using native nested access only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Drop the BenchmarkConfig flat-forwarding properties so call sites read cfg.artifacts, cfg.mlflow, cfg.otel, and run.benchmark_id directly. Tests build real BenchmarkConfig / BenchmarkRun instead of CLIConfig DTO mocks. The OTel fanout subprocess now consumes the native MLflowConfig. Split src/aiperf/config/artifacts.py into one section per file matching the benchmark fields: mlflow, otel, server_metrics, gpu_telemetry. Same for TokenizerConfig, LoggingConfig, and SLOsConfig out of models.py and runtime.py. External imports through aiperf.config are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
…ugin metadata Replace the three hard-coded dicts in user_config.py (_LOCAL_COLLECTOR_KEYWORDS, _LOCAL_ONLY_COLLECTORS, _LOCAL_COLLECTOR_INSTALL_HINTS) with a typed metadata schema on the gpu_telemetry_collector plugin category. Adding a new local collector now only requires editing plugins.yaml: `is_local: true` plus an `install_hint`. Module name defaults to the plugin name when `import_module` is unset. - New `GPUTelemetryCollectorMetadata` Pydantic class (is_local, import_module, install_hint) wired via `metadata_class` in categories.yaml. - `get_gpu_telemetry_collector_metadata` helper + `_CATEGORY_METADATA_CLASSES` registration in plugin/plugins.py, matching the existing get_endpoint/plot/ service_metadata pattern. - Metadata populated for pynvml and amdsmi (dcgm relies on `is_local: false` default). - user_config helpers `_local_collector_keywords`, `_is_local_collector`, and `_ensure_local_collector_importable` consult plugin metadata. The "Invalid GPU telemetry item" error message self-derives from the keywords dict, so a new local collector flows through with no edits to error text. - New test `test_local_collector_discovered_dynamically_from_plugin_metadata` registers a fake collector via `mock_plugin` (and a matching enum extension) to prove selection, conflict detection, local-vs-URL guardrail, and error message derivation all generalize beyond pynvml/amdsmi. Behavior preserved: every existing test passes because `str(GPUTelemetryCollectorType.PYNVML) == "pynvml"`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Move each *Defaults dataclass next to the config section it belongs to: EndpointDefaults, OutputDefaults, MLflowDefaults, TokenizerDefaults, ServiceDefaults into endpoint/artifacts/mlflow/tokenizer/runtime modules. Dataset and prompt modality defaults move to dataset/defaults.py. The aggregator src/aiperf/config/defaults.py is deleted. External imports through aiperf.config are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Drive local-collector classification from plugin metadata (GPUTelemetryCollectorMetadata) so adding a new local collector only requires editing plugins.yaml. The original PR refactored the v1 aiperf.common.config.user_config module that no longer exists on this branch; the equivalent metadata-driven validation now lives on GpuTelemetryConfig (local-vs-URL guardrail and install-hint surfacing) and the CLI converter derives local keywords from plugin metadata. The AMD ROCm amdsmi collector ships with this merge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Medium: - execute_adaptive_search: cancel path now writes search_history.json before returning (previously left stale history after cancellation). Factored a `_flush_history` closure to share the write across the cancel, converged, and per-iteration paths. Low / nit cleanup: - Add `SearchPlanner.iter_count` property; orchestrator no longer reads the private `planner._iter`. - Remove dead `_build_convergence_criterion` copy in orchestrator.py (canonical version lives in `_cli_runner_helpers`). - Drop the duplicate `_plan_iteration_order` in orchestrator.py; import the canonical one from `_cli_runner_sweep_helpers`. - Remove dead `LocalSubprocessExecutor._write_redacted_config` (`EndpointConfig.api_key` field_serializer already redacts). - Replace `# noqa: ANN202` placeholders with concrete return types on `_build_convergence_criterion`, `_build_search_planner`, `_maybe_compute_detailed`, and `_setup_ui_queues`. - `_log_failed_sweep_variations`: extract a single `_format_key` helper and stop double-formatting the key string between the warning summary and the per-run loop. - `_summarize_and_export`: run per-variation and sweep-aggregate exports concurrently under one `asyncio.run(gather(...))` instead of two sequential `asyncio.run` calls. - Distinguish "1 successful run" vs "1 variation succeeded" wording in the aggregate-summary warning so sweep users see accurate language. - Collapse `FixedTrialsStrategy._sanitize_label` into the module-level helper it duplicated verbatim. - Remove the unused `_scrub_non_finite` shim in `_cli_runner_post_process`; the one test caller now imports `scrub_non_finite` from `aiperf.common.finite`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
…kage Reviewer feedback flagged the leading-underscore-at-package-root layout as unusual. Convert `cli_runner.py` into a package and move the three `_cli_runner_*` helpers plus `_sweep_table_logger.py` in as private submodules. No behavior change; pure relocation + import rewrites. Mapping: - src/aiperf/cli_runner.py -> src/aiperf/cli_runner/__init__.py - src/aiperf/_cli_runner_helpers.py -> src/aiperf/cli_runner/_helpers.py - src/aiperf/_cli_runner_post_process.py -> src/aiperf/cli_runner/_post_process.py - src/aiperf/_cli_runner_sweep_helpers.py -> src/aiperf/cli_runner/_sweep_helpers.py - src/aiperf/_sweep_table_logger.py -> src/aiperf/cli_runner/_sweep_table_logger.py All imports in src/, tests/, docs/, and the ruff/ergonomics baselines updated to match. Pre-commit and the full 11,736-test unit suite pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
`cli_runner/__init__.py` was 790 lines (grandfathered above the 500-line
ergonomics cap). Split the obvious independent concerns into siblings;
__init__.py keeps the orchestrating entry points + the helpers that need
to share the module namespace with test mock targets.
New layout:
- _callbacks.py CompletedRun, OnComplete, _invoke_callbacks
- _preflight.py _preflight_artifact_dir/_fd_limit/_endpoint_ready
- _process_setup.py mp start method, log queue, FD_CLOEXEC, tokenizer preload
- _single_run.py _run_single_benchmark
- _failure_summary.py _log_failed_sweep_variations
- __init__.py run_benchmark + multi-run orchestration + helpers
__init__.py drops 790 -> 409 lines, falling out of the ergonomics
baseline. All eleven `patch("aiperf.cli_runner.<name>")` targets remain
reachable through __init__.py imports. test_cli_runner_macos.py imports
the _process_setup helpers from their new path directly.
11,736 unit tests pass; all pre-commit hooks pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Previous layout was organized by 'what kind of helper' (preflight,
callbacks, process-setup) with two grab-bag files (_helpers.py,
_sweep_helpers.py) and an __init__.py that mixed dispatch with
multi-run orchestration. This pass organizes by the package's three
real domains: execution, aggregation, display.
Deleted:
_helpers.py distributed across _strategy/_aggregate/_banner
_failure_summary.py folded into _multi_run.py (its only caller)
Renamed:
_sweep_helpers.py -> _sweep_aggregate.py
_sweep_table_logger.py -> _sweep_table.py
Created:
_strategy.py build_strategy, _build_convergence_criterion,
_build_search_planner, validate_convergence_config
_aggregate.py aggregate_and_export, print_aggregate_summary,
_maybe_compute_detailed, priority-metric block printers
_banner.py log_multi_run_banner, _log_search_planner_active
_multi_run.py _run_multi_benchmark + _execute_multi_benchmark +
_summarize_and_export + _estimate_and_log_duration +
_validate_multi_benchmark_plan +
_reject_in_process_sweep_under_operator +
_log_failed_sweep_variations (was _failure_summary)
__init__.py drops 415 -> 120 lines and now contains only the public
surface (run_benchmark, CompletedRun, OnComplete, _make_benchmark_run)
plus re-imports of the run_benchmark-layer patch targets (_preflight_*,
_run_single_benchmark, _run_multi_benchmark) so existing dispatch
tests keep working without touching their patches.
Test patches for multi-run internals (aggregate_and_export,
_estimate_and_log_duration, _summarize_and_export, build_strategy,
_build_search_planner, _log_search_planner_active) now correctly target
the call site (`aiperf.cli_runner._multi_run.<name>`) following the
standard 'patch where it's looked up' rule.
ruff_baseline.json and ergonomics_baseline.json updated for the new
paths; docs/troubleshooting/sweeps.md and docs/dev/sweep-orchestrator.md
updated likewise.
11,736 unit tests pass; all pre-commit hooks pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Move pareto-axes resolution and per-cell pareto projection out of
_sweep_aggregate.py into a dedicated _pareto.py:
_resolve_pareto_axes plugin-registry recipe lookup for pareto_axes
_extract_axis_value pull axis value from per-cell stats with fallbacks
_aggregate_one_cell project one variation's runs into a Pareto cell
_sweep_aggregate.py drops 775 -> 643 lines and now contains only the
per-variation + sweep-wide aggregation pipelines (no per-cell pareto
math). Lazy import inside _aggregate_one_cell avoids the cycle with
_sweep_aggregate's top-level _resolve_pareto_axes import.
External callers updated:
- aiperf.orchestrator.orchestrator._fire_cell_callback
- aiperf.cli_runner._sweep_table.SweepTableLogger
- tests/unit/test_aggregate_one_cell.py
- ruff_baseline.json (BLE001 entry path)
Also strip refactor-provenance docstrings and comments across the
codebase ("Lifted out of X", "Extracted from Y to keep that module
under the 500-line cap", "Factored out of Z so the helper exists",
etc.). The git history is the right place for that information; in
the code it's noise. Touched ~20 files in cli_runner/, search_recipes/,
config/, orchestrator/, plugin/.
11,736 unit tests pass; all pre-commit hooks pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
`_invoke_callbacks` and `_print_aggregate_summary` were re-exported at the package surface solely so tests could import them at `aiperf.cli_runner.<name>`. They have no production callers outside the package. Move both off the public surface; update the two test files that imported them to read from `_callbacks` and `_aggregate` directly. `_print_aggregate_summary` was also an unnecessary rename of `print_aggregate_summary` from `_aggregate` (the underscore prefix was left over from when everything lived in a single `cli_runner.py`); tests now use the real name. 11,736 unit tests pass; all pre-commit hooks pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Drive GPU telemetry collector setup through a shared candidate loop so DCGM and local collectors follow the same probe, baseline, and status flow while plugin metadata only handles config-time local classification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Three concrete fixes: 1. `__init__.py` `__all__` listed `_run_single_benchmark` and `_run_multi_benchmark` alongside public names. Underscore-prefix says "private" and __all__ says "public" — pick one. Both stay importable for tests; they're just no longer advertised as public API. 2. `_multi_run.py` had a `_ = (CompletedRun, OnComplete)` line that claimed to suppress an unused-import warning. Both names are actually used (CompletedRun constructor at line 104, OnComplete in two function signatures). The line was a leftover; remove. The trailing `__all__ = ["_run_multi_benchmark"]` also went — Python's default "names without leading underscore are public" makes it redundant here. 3. `__init__.py` docstring listed every helper submodule by name. Each module has its own docstring; this index was just maintenance burden. Trim to the actual public surface. No behavior change; no test patches updated. 11,736 unit tests pass; all pre-commit hooks pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Three private helpers had Any-typed parameters where the call-site type is statically known: - _log_search_planner_active(search_planner: Any, logger: Any) in _banner.py becomes (SearchPlanner | None, AIPerfLogger). The caller in _multi_run._execute_multi_benchmark already passes those types. - _print_metric_block(metric, ...) in _aggregate.py becomes (metric: Any, ...) so the function has a complete signature (the upstream AggregateResult.metrics dict is dict[str, Any], so Any is the honest type here). - _aggregate_one_cell(cell_results: list[Any], plan: Any, variation: Any) in _pareto.py becomes (list[RunResult], BenchmarkPlan, SweepVariation). Both callers (orchestrator._fire_cell_callback and aggregate_sweep_and_export) already pass those types. No behavior change; pure annotation tightening. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Drop the underscore prefix on a function that is genuinely shared across packages: cli_runner.run_benchmark, cli_commands.service, and orchestrator.orchestrator (its own caller) all use it. Reaching into another module for a leading-underscore "private" function is a smell; the public name matches its actual usage. Touches every call site (3 prod paths) plus 3 docstring/comment references in tests and config/loader/plan.py. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
debermudez
approved these changes
May 15, 2026
debermudez
left a comment
Contributor
There was a problem hiding this comment.
Good to go once its rebased on main. Very nice work!
…ONL routing - OTel: `--stream` now accepts a list; add `--otel-resource-attributes` with key=value parsing - MLflow: rename to singular `--mlflow-tag` / `--mlflow-artifact-glob`; parse tags into a dict; surface schema-1.1 `count`/`sum` size fields in the exporter and the sweep aggregator - Accuracy: `--accuracy-n-shots` becomes `int | None` (cap 32, defers to benchmark default); `--accuracy-enable-cot` becomes tri-state; tasks accept comma-separated lists - Endpoint: prepend `http://` to schemeless URLs via AfterValidator; timeout now reads from `EndpointDefaults.TIMEOUT` - Dataset: wire `DAG_JSONL` through the resolver and custom composer format map - Server metrics: drop `parquet` from default formats - Records pipeline: `OutputsJsonRecordProcessor` takes `run` (BenchmarkRun) instead of bare `cfg`; `RecordsManager` reads `self.run.cfg.otel` - Schema: regenerate `aiperf-config.schema.json` with otel/mlflow sections and `image_edit` endpoint enum Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
The bulk of the v1 user_config.py changes from #941 (plugin-driven local-collector keyword detection, validate_environment on collector classes, generic conflict-detection wording) are already pre-migrated into this branch's new src/aiperf/config/ package — see GpuTelemetryConfig.validate_collector_compatibility() in src/aiperf/config/gpu_telemetry.py and the warning hook in src/aiperf/config/flags/_converter_telemetry.py. Manual resolutions: - src/aiperf/common/config/user_config.py: deleted (modify side from main is already represented in the new config/ package). - src/aiperf/gpu_telemetry/manager.py: adopt main's _collector_candidates / _configure_reachable_collectors / _capture_collector_baseline plugin- dispatch design; keep branch's BenchmarkRun-driven constructor and gpu_telemetry_cfg.* access pattern; drop the legacy _configure_pynvml_collector / _configure_amdsmi_collector / _configure_dcgm_collectors helpers. - tests/unit/common/config/test_user_config.py: take branch (legacy v1 CLIConfig smoke tests only; UserConfig-validator tests from main no longer apply since the v1 module is gone). - tests/unit/gpu_telemetry/test_telemetry_manager.py: keep both imports (make_run_from_cli from branch, mock_plugin from main). - src/aiperf/plugin/schema/{schemas.py,plugins.schema.json,plugins.py}: keep branch's orchestrator metadata + main's GPUTelemetryCollectorMetadata side-by-side. - tools/ergonomics_baseline.json: union of both file-size entries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`normalize_http_url` checked only for ``://`` to decide whether to
prepend ``http://``, so a bare ``scheme:opaque`` form like
``javascript:alert(1)`` or ``data:text/plain;base64,xyz`` was
silently rewritten to ``http://javascript:alert(1)`` and then either:
- rejected with the wrong message ("invalid port" instead of
"missing scheme or host"), or
- SILENTLY ACCEPTED when the opaque part happened to be all digits —
``javascript:1234`` became ``http://javascript:1234`` with
host=``javascript``, port=1234, a real validation bypass.
Leave the URL alone when the colon prefix is a recognized foreign URI
scheme (javascript, data, file, ftp, ftps, sftp, ssh, gopher, ldap[s],
mailto, tel, vbscript, ws, wss) so the downstream EndpointConfig
validator can reject it as "missing scheme or host". ``localhost:8000``
and ``host:port`` shorthand still work because they don't match a
known foreign scheme.
Fixes the canary test
``tests/unit/transports/test_build_url_adversarial.py::
test_endpoint_validator_rejects_garbage[javascript-scheme]``.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The new DatasetResolver had two bugs that wedged sagemaker_data_capture runs with --fixed-schedule before the loader ever ran: - _check_timing_data only checked top-level `timestamp`/`delay` keys, so sagemaker records (timing under `eventMetadata.inferenceTime`) were flagged as having no timing data. Add a per-type branch and pass `dataset_type` through. - _resolve_one skipped structural auto-detection whenever `ds.format` was truthy, but Pydantic defaults `format` to SINGLE_TURN. Result: the resolver pre-validated against SINGLE_TURN when the user relied on auto-detect. Use `model_fields_set` so detection runs unless the user explicitly set `format`, matching how the composer infers type at load time. Fixes the 5 sagemaker integration tests in tests/integration/test_sagemaker_data_capture.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ports tools/generate_config_schema.py from ajc/k8s-rework so the hand-maintained aiperf-config.schema.json can be regenerated from the Pydantic models. Adds `make generate-config-schema` / `make check-config-schema` targets and a `generate-config-schema` pre-commit hook that regenerates on AIPerfConfig changes. Also folds in stray cleanups picked up while touching this area: - CustomDatasetComposer._format_to_loader_type: replace hand-maintained dict with direct CustomDatasetType(fmt.value) — both enums mirror the custom_dataset_loader plugin registry and share string values. - test_dag_timing_pathology: fix British→US spellings flagged by codespell. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings in 6 main PRs: #710 (--image-source flag, multimodal validator hardening), #945 (--session-header), #884 (Agentic Code dataset docs), #826 (UTF-8 for text/JSON file reads), #942 (per-request `extra` payload), #943 (agentic coding dataset files). The v1 ``src/aiperf/common/config/`` modules were deleted on this branch as part of the config v2 restructure, so the corresponding modify/delete conflicts keep the deletes and the relevant main-side features are ported into the v2 surface: - ``--image-source`` (PR #710): adds an ``ImageSource | Path`` field with ``BeforeValidator`` coercion to ``aiperf.config.dataset.content.ImageConfig`` plus an ``images_enabled()`` helper. ``ImageGenerator`` dispatches per source mode (ASSETS, NOISE, custom Path) matching main's behavior. - ``--session-header`` (PR #945): adds the field to ``EndpointConfig``, routes it through ``_converter_endpoint`` / ``_section_fields`` / ``CLIConfig`` so the flag round-trips into ``EndpointInfo.session_header`` (already wired through ``base_transports``). UTF-8 fix from PR #826 is applied to ``BaseFileLoader._iter_record_dicts`` so the encoding fix flows through every loader that uses the helper. Test conflicts: - ``tests/unit/common/config/test_user_config.py``: kept the v2-flavor smoke tests (the v1 alternate body tested deleted v1 modules and was not portable). - ``tests/unit/dataset/generator/test_image_generator.py``: kept the v2 ``make_image_config`` helper, added a ``source`` knob, aliased the v1 ``ImageWidthConfig`` / ``ImageHeightConfig`` to ``NormalDistribution`` so PR #710's new test classes (Noise mode, custom directory, disabled) read naturally, and added ``batch_size=1`` to the ImageConfig sites those tests construct directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d --image-source Doc drift from prior commits on the branch; regenerated by the generate-cli-docs pre-commit hook. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
… env-var/Jinja in FixedDistribution shorthand
`${VAR}` whole-string substitutions now coerce to bool/int/float
using the same rules as Jinja, so `isl: ${AIPERF_TEST_ISL}` resolves
to a numeric distribution scalar. The schema generator emits matching
string-pattern branches under `FixedDistribution` so YAMLs using these
forms validate cleanly.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
The strict-xfail sanity test for Responses-endpoint `max_tokens=0` emission was a tripwire that no longer adds signal — the None-check semantics are exercised by the surrounding positive tests. Removing the inverted assertion to keep the suite focused on direct behavioural assertions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
for more information, see https://pre-commit.ci
`tests/unit/config/test_config_schema_generator_integration.py` imports `jsonschema.Draft202012Validator` but the dep was never declared — it was resolved transitively in local envs, so CI's test-imports check fails with `ModuleNotFoundError: No module named 'jsonschema'`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
3 tasks
saturley-hall
added a commit
that referenced
this pull request
May 19, 2026
PR #912 (commit 94a9102) rewrote tokenizer_validator.py and introduced a ProcessPoolExecutor-based HF cache prefetch in validate_tokenizer_early. The skip-prefetch gate ANDed HF_HUB_OFFLINE and TRANSFORMERS_OFFLINE, but the component_integration conftest only set HF_HUB_OFFLINE, so the gate never engaged and the prefetch subprocesses ran. Those subprocesses bypass the in-process Tokenizer.from_pretrained patch (subprocess re-imports HF cleanly) and try to write to the real HF cache. In restricted environments (Linux CI containers, sandboxes) or under concurrent test execution that races on the cache directory, the write fails with EPERM and aiperf aborts with "Configuration resolution failed: [Errno 1] Operation not permitted" -- which surfaced as `request_count == 0` in every component_integration test that runs `aiperf profile ...` on Linux CI (ubuntu-latest and the new builder pool). This has been the cause of run-unit-tests failures on main since 2026-05-15. - src/aiperf/common/tokenizer_validator.py: change the skip-prefetch gate from AND to OR. Either env var being set is enough -- both mean "I have a warm cache, do not touch the network/disk." Requiring both was overly conservative and is what masked the test-harness bug. - tests/component_integration/conftest.py: hf_offline_mode fixture now sets both HF_HUB_OFFLINE and TRANSFORMERS_OFFLINE so the prefetch skip-gate engages even on older aiperf builds where the gate is still ANDed. Restoring prior values for both on teardown. Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
debermudez
added a commit
that referenced
this pull request
May 19, 2026
…ked (AIP-877) Adds HellaSwag commonsense reasoning benchmark + the strict ExactMatchGrader that DeepEval's ``Scorer.exact_match_score`` uses. Both align byte-for- byte with the trt-llm benchmark recipe's DeepEval-backed configuration. **HellaSwagBenchmark** (`src/aiperf/accuracy/benchmarks/hellaswag.py`) - Loads ``Rowan/hellaswag``: validation split filtered per task by ``activity_label``, train split feeds the "one few-shot per unique activity_label" rule (mirrors ``deepeval.benchmarks.HellaSwag``'s ``categories_seen`` dedupe loop). - Prompt rendering delegates to ``deepeval.benchmarks.hellaswag.template.HellaSwagTemplate.generate_output`` — output is byte-equal to what the trt-llm recipe ships. - Defaults: ``n_shots=10`` (DeepEval cap is 15), ``generation_size=5``, ``default_grader=exact_match``. - Ground truth: bare ``A``/``B``/``C``/``D`` letter (DeepEval's convention for ``Scorer.exact_match_score``). - ``_resolve_tasks`` matches activity labels case-insensitively via a lowercased-value map; falls back to upper-snake-case enum name (``HellaSwagTask.APPLYING_SUNSCREEN`` form) for the recipe's ``getattr(HellaSwagTask, name.upper(), None)`` parity. **ExactMatchGrader** (`src/aiperf/accuracy/graders/exact_match.py`) - Strict ``pred.strip() == gold.strip()`` semantics matching DeepEval's ``Scorer.exact_match_score``: case-sensitive, no normalization, empty response → ``unparsed=True``. - Used by HellaSwag and (in AIP-878) BigBench-Hard for reference parity. **Plugin registration** (`src/aiperf/plugin/plugins.yaml`) - ``hellaswag`` → ``default_grader: exact_match``, ``default_n_shots: 10`` with the DeepEval-backed description. - ``exact_match`` → strict-equality description; drops the ``is_implemented: false`` flag. **Dependencies** (`pyproject.toml`) - Adds ``deepeval>=2.9.0`` to the ``[accuracy]`` optional-dependency group. Aiperf calls DeepEval's bundled prompt template directly so the dep is required for HellaSwag (and BigBench-Hard in AIP-878). **Tests** (`tests/unit/accuracy/`) - ``test_hellaswag_benchmark.py``: ~22 tests covering DeepEval prompt byte-equality, the unique-activity-label shots set rule, validation filtering, task resolution (exact, lower, upper, mixed case), and pathological dataset rows (empty validation, unlabeled rows). - ``test_exact_match_grader.py``: strict-equality semantics including the empty-response → ``unparsed=True`` path and case-sensitivity. - ``test_accuracy_config.py``: drops ``hellaswag`` from ``STUB_BENCHMARKS`` and ``exact_match`` from ``STUB_GRADERS``; the uppercase-stub test now uses ``BIGBENCH`` (a still-stub name). **Docs** (`docs/accuracy/`) - ``accuracy-benchmarking.md``: add HellaSwag row to the benchmarks table. - ``accuracy_stubs.md``: status summary + move HellaSwag from "Still Stubbed" to "Implemented"; move ExactMatchGrader to "Implemented". **Constructor signature** Loader + grader use the v2 ``BenchmarkRun`` API (post-#912 refactor on main) rather than the legacy ``UserConfig`` shape — matches how ``MMLUBenchmark`` and ``AIMEBenchmark`` are wired on current main. Validation: - 70/70 accuracy tests pass (HellaSwag + ExactMatch + AccuracyConfig). - Ruff format + ruff check clean on all modified Python files. - Codespell clean (v2.4.2, matches CI). - HellaSwag prompts verified byte-equal against ``HellaSwagTemplate.generate_output`` on synthetic fixtures. Reference: - ``deepeval/benchmarks/hellaswag/hellaswag.py`` - ``deepeval/benchmarks/hellaswag/template.py`` - ``deepeval/scorer/scorer.py:Scorer.exact_match_score`` - ``trt-llm-benchmark-recipe/src/tools/acc_benchmark.py:319-336`` Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
debermudez
added a commit
that referenced
this pull request
May 19, 2026
…ked (AIP-877) Adds HellaSwag commonsense reasoning benchmark + the strict ExactMatchGrader that DeepEval's ``Scorer.exact_match_score`` uses. Both align byte-for- byte with the trt-llm benchmark recipe's DeepEval-backed configuration. **HellaSwagBenchmark** (`src/aiperf/accuracy/benchmarks/hellaswag.py`) - Loads ``Rowan/hellaswag``: validation split filtered per task by ``activity_label``, train split feeds the "one few-shot per unique activity_label" rule (mirrors ``deepeval.benchmarks.HellaSwag``'s ``categories_seen`` dedupe loop). - Prompt rendering delegates to ``deepeval.benchmarks.hellaswag.template.HellaSwagTemplate.generate_output`` — output is byte-equal to what the trt-llm recipe ships. - Defaults: ``n_shots=10`` (DeepEval cap is 15), ``generation_size=5``, ``default_grader=exact_match``. - Ground truth: bare ``A``/``B``/``C``/``D`` letter (DeepEval's convention for ``Scorer.exact_match_score``). - ``_resolve_tasks`` matches activity labels case-insensitively via a lowercased-value map; falls back to upper-snake-case enum name (``HellaSwagTask.APPLYING_SUNSCREEN`` form) for the recipe's ``getattr(HellaSwagTask, name.upper(), None)`` parity. **ExactMatchGrader** (`src/aiperf/accuracy/graders/exact_match.py`) - Strict ``pred.strip() == gold.strip()`` semantics matching DeepEval's ``Scorer.exact_match_score``: case-sensitive, no normalization, empty response → ``unparsed=True``. - Used by HellaSwag and (in AIP-878) BigBench-Hard for reference parity. **Plugin registration** (`src/aiperf/plugin/plugins.yaml`) - ``hellaswag`` → ``default_grader: exact_match``, ``default_n_shots: 10`` with the DeepEval-backed description. - ``exact_match`` → strict-equality description; drops the ``is_implemented: false`` flag. **Dependencies** (`pyproject.toml`) - Adds ``deepeval>=2.9.0`` to the ``[accuracy]`` optional-dependency group. Aiperf calls DeepEval's bundled prompt template directly so the dep is required for HellaSwag (and BigBench-Hard in AIP-878). **Tests** (`tests/unit/accuracy/`) - ``test_hellaswag_benchmark.py``: ~22 tests covering DeepEval prompt byte-equality, the unique-activity-label shots set rule, validation filtering, task resolution (exact, lower, upper, mixed case), and pathological dataset rows (empty validation, unlabeled rows). - ``test_exact_match_grader.py``: strict-equality semantics including the empty-response → ``unparsed=True`` path and case-sensitivity. - ``test_accuracy_config.py``: drops ``hellaswag`` from ``STUB_BENCHMARKS`` and ``exact_match`` from ``STUB_GRADERS``; the uppercase-stub test now uses ``BIGBENCH`` (a still-stub name). **Docs** (`docs/accuracy/`) - ``accuracy-benchmarking.md``: add HellaSwag row to the benchmarks table. - ``accuracy_stubs.md``: status summary + move HellaSwag from "Still Stubbed" to "Implemented"; move ExactMatchGrader to "Implemented". **Constructor signature** Loader + grader use the v2 ``BenchmarkRun`` API (post-#912 refactor on main) rather than the legacy ``UserConfig`` shape — matches how ``MMLUBenchmark`` and ``AIMEBenchmark`` are wired on current main. Validation: - 70/70 accuracy tests pass (HellaSwag + ExactMatch + AccuracyConfig). - Ruff format + ruff check clean on all modified Python files. - Codespell clean (v2.4.2, matches CI). - HellaSwag prompts verified byte-equal against ``HellaSwagTemplate.generate_output`` on synthetic fixtures. Reference: - ``deepeval/benchmarks/hellaswag/hellaswag.py`` - ``deepeval/benchmarks/hellaswag/template.py`` - ``deepeval/scorer/scorer.py:Scorer.exact_match_score`` - ``trt-llm-benchmark-recipe/src/tools/acc_benchmark.py:319-336`` Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
debermudez
added a commit
that referenced
this pull request
May 22, 2026
…ked (AIP-877) Adds HellaSwag commonsense reasoning benchmark + the strict ExactMatchGrader that DeepEval's ``Scorer.exact_match_score`` uses. Both align byte-for- byte with the trt-llm benchmark recipe's DeepEval-backed configuration. **HellaSwagBenchmark** (`src/aiperf/accuracy/benchmarks/hellaswag.py`) - Loads ``Rowan/hellaswag``: validation split filtered per task by ``activity_label``, train split feeds the "one few-shot per unique activity_label" rule (mirrors ``deepeval.benchmarks.HellaSwag``'s ``categories_seen`` dedupe loop). - Prompt rendering delegates to ``deepeval.benchmarks.hellaswag.template.HellaSwagTemplate.generate_output`` — output is byte-equal to what the trt-llm recipe ships. - Defaults: ``n_shots=10`` (DeepEval cap is 15), ``generation_size=5``, ``default_grader=exact_match``. - Ground truth: bare ``A``/``B``/``C``/``D`` letter (DeepEval's convention for ``Scorer.exact_match_score``). - ``_resolve_tasks`` matches activity labels case-insensitively via a lowercased-value map; falls back to upper-snake-case enum name (``HellaSwagTask.APPLYING_SUNSCREEN`` form) for the recipe's ``getattr(HellaSwagTask, name.upper(), None)`` parity. **ExactMatchGrader** (`src/aiperf/accuracy/graders/exact_match.py`) - Strict ``pred.strip() == gold.strip()`` semantics matching DeepEval's ``Scorer.exact_match_score``: case-sensitive, no normalization, empty response → ``unparsed=True``. - Used by HellaSwag and (in AIP-878) BigBench-Hard for reference parity. **Plugin registration** (`src/aiperf/plugin/plugins.yaml`) - ``hellaswag`` → ``default_grader: exact_match``, ``default_n_shots: 10`` with the DeepEval-backed description. - ``exact_match`` → strict-equality description; drops the ``is_implemented: false`` flag. **Dependencies** (`pyproject.toml`) - Adds ``deepeval>=2.9.0`` to the ``[accuracy]`` optional-dependency group. Aiperf calls DeepEval's bundled prompt template directly so the dep is required for HellaSwag (and BigBench-Hard in AIP-878). **Tests** (`tests/unit/accuracy/`) - ``test_hellaswag_benchmark.py``: ~22 tests covering DeepEval prompt byte-equality, the unique-activity-label shots set rule, validation filtering, task resolution (exact, lower, upper, mixed case), and pathological dataset rows (empty validation, unlabeled rows). - ``test_exact_match_grader.py``: strict-equality semantics including the empty-response → ``unparsed=True`` path and case-sensitivity. - ``test_accuracy_config.py``: drops ``hellaswag`` from ``STUB_BENCHMARKS`` and ``exact_match`` from ``STUB_GRADERS``; the uppercase-stub test now uses ``BIGBENCH`` (a still-stub name). **Docs** (`docs/accuracy/`) - ``accuracy-benchmarking.md``: add HellaSwag row to the benchmarks table. - ``accuracy_stubs.md``: status summary + move HellaSwag from "Still Stubbed" to "Implemented"; move ExactMatchGrader to "Implemented". **Constructor signature** Loader + grader use the v2 ``BenchmarkRun`` API (post-#912 refactor on main) rather than the legacy ``UserConfig`` shape — matches how ``MMLUBenchmark`` and ``AIMEBenchmark`` are wired on current main. Validation: - 70/70 accuracy tests pass (HellaSwag + ExactMatch + AccuracyConfig). - Ruff format + ruff check clean on all modified Python files. - Codespell clean (v2.4.2, matches CI). - HellaSwag prompts verified byte-equal against ``HellaSwagTemplate.generate_output`` on synthetic fixtures. Reference: - ``deepeval/benchmarks/hellaswag/hellaswag.py`` - ``deepeval/benchmarks/hellaswag/template.py`` - ``deepeval/scorer/scorer.py:Scorer.exact_match_score`` - ``trt-llm-benchmark-recipe/src/tools/acc_benchmark.py:319-336`` Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
debermudez
added a commit
that referenced
this pull request
May 22, 2026
Implements the BigBench-Hard accuracy benchmark by delegating prompt rendering to ``deepeval.benchmarks.BigBenchHard``'s ``BigBenchHardTemplate.generate_output``. Output is byte-equal to the trt-llm benchmark recipe's DeepEval-backed configuration so reference parity is preserved end-to-end. Pairs with the existing ``ExactMatchGrader`` (landed via AIP-877) for the recipe's strict ``Scorer.exact_match_score`` semantics. Loader uses the new ``BenchmarkRun`` constructor signature introduced by PR #912 (no ``UserConfig``), and the test fixture wires through the ``make_benchmark_run`` conftest helper. ``deepeval`` is already pinned in the ``[accuracy]`` extras via AIP-877 — the test guards on ``pytest.importorskip("deepeval")`` so the suite still runs without the optional install. Drops ``bigbench`` from ``STUB_BENCHMARKS``, removes ``is_implemented: false`` from the ``plugins.yaml`` entry, and updates the accuracy docs to reflect the new implemented status. The uppercase-stub validator test now exercises ``LCB_CODEGENERATION`` since ``BIGBENCH`` is no longer stubbed. Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
5 tasks
debermudez
added a commit
that referenced
this pull request
May 22, 2026
Implements the BigBench-Hard accuracy benchmark by delegating prompt rendering to ``deepeval.benchmarks.BigBenchHard``'s ``BigBenchHardTemplate.generate_output``. Output is byte-equal to the trt-llm benchmark recipe's DeepEval-backed configuration so reference parity is preserved end-to-end. Pairs with the existing ``ExactMatchGrader`` (landed via AIP-877) for the recipe's strict ``Scorer.exact_match_score`` semantics. Loader uses the new ``BenchmarkRun`` constructor signature introduced by PR #912 (no ``UserConfig``), and the test fixture wires through the ``make_benchmark_run`` conftest helper. ``deepeval`` is already pinned in the ``[accuracy]`` extras via AIP-877 — the test guards on ``pytest.importorskip("deepeval")`` so the suite still runs without the optional install. Drops ``bigbench`` from ``STUB_BENCHMARKS``, removes ``is_implemented: false`` from the ``plugins.yaml`` entry, and updates the accuracy docs to reflect the new implemented status. The uppercase-stub validator test now exercises ``LCB_CODEGENERATION`` since ``BIGBENCH`` is no longer stubbed. Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
debermudez
added a commit
that referenced
this pull request
May 27, 2026
Implement ``AIME24Benchmark`` to mirror the trt-llm benchmark recipe's
``acc_bench_lighteval.py`` configuration for AIME 2024:
aime24 = LightevalTaskConfig(
name="aime24",
prompt_function=aime_prompt_fn,
hf_repo="HuggingFaceH4/aime_2024",
evaluation_splits=["train"],
few_shots_split=None,
few_shots_select=None,
generation_size=32768,
metric=[expr_gold_metric],
)
The recipe's ``aime_prompt_fn`` produces a ``Doc`` whose ``query`` is
the bare problem text — lighteval's prompt manager wraps it as a
single user message with no instruction prefix and no few-shot
priming. The loader emits prompts the same way: one
``BenchmarkProblem`` per dataset row, ``prompt`` = the bare
``problem`` field, ``ground_truth`` = ``str(answer)``,
``metadata.generation_size`` = 32768. ``tasks`` / ``n_shots`` /
``enable_cot`` arguments are accepted for protocol uniformity but
ignored (any of them changing the prompt would diverge from the
reference). Pair with ``LightevalExprGrader`` for the recipe's
``expr_gold_metric`` extraction.
Built on the v2 ``BenchmarkRun`` API (post-PR-#912) and on the AIP-878
test harness conventions: ``make_benchmark_run`` for fixtures,
``BenchmarkProblem``-driven assertions, ``patch`` on
``aime24.load_dataset`` for deterministic rows. The loader has no
heavy optional dependency (``datasets`` is a core dep), so no
fake-harness is needed; CI gets 100% line + branch coverage out of
the box.
Updates the stub registry: drop ``aime24`` from
``test_accuracy_config.STUB_BENCHMARKS``, drop the ``is_implemented:
false`` flag from the ``aime24`` plugins.yaml entry, switch
``default_grader`` to ``lighteval_expr``, add an ``aime24`` row to
``docs/accuracy/accuracy-benchmarking.md``, and move it from "Still
Stubbed" to "Implemented" in ``accuracy_stubs.md`` (refreshing the
Status Summary, Method Count Summary, and Suggested Implementation
Order sections accordingly).
Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
5 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
orchestrator/executor.py,local_executor.py) and asearch_planner/package: Bayesian (BoTorch GP/DSP kernel), Optuna (multi-objective), Monotonic, and Smooth-Isotonic planners, with shared helpers for cliff detection, margin normalization, replicate budget, pooled percentile, and SLA constraints. Aggregation gains SLA filtering and multi-objective search-history export.src/aiperf/config/) replaces the oldcommon/config/package. Layered design: typed models, loader (Jinja2 templating, env-var interpolation, dotted-path overrides, duration parsing, strict-undefined plan), flags converter (CLI ↔ YAML), resolution layer, sweep DSL (grid, QMC/Sobol, adaptive, multi-run, distributions), public JSON schema, plus a bundled library of 20+ ready-to-run templates and reference trace data.src/aiperf/search_recipes/) with built-ins:max_concurrency_under_sla,max_goodput_under_slo,sla_breach_knee,itl_surface_fit,ttft_curve_fit, and Pareto sweep (axes, dominance, export, parser) plus post-process hooks._cli_runner_{helpers,sweep_helpers,post_process}.py+_sweep_table_logger.py; newaiperf configcommand for template discovery and validation.plot/auto_plot.py) materializes the resolved plot config into the artifact dir soaiperf plot <dir>reproduces.PlotEnvelopeConfiglets one YAML own its visualization.common/finite.py(FiniteFloat,scrub_non_finite,nan_safe_mean/std,is_finite_value); property-test corpus with ratcheted baselines intests/unit/property/.tests/unit/{config,orchestrator,search_recipes,search_planner,cli_runner,property}/and adversarial chaos scripts undertests/scripts/chaos/.766 files changed (+111246, -24748). Full design write-up at
docs/dev/sweep-orchestrator.md.Architecture
One pipeline at every cardinality
A single benchmark, a multi-run for confidence intervals, a grid/scenarios sweep, a Sobol/LHS characterization, an adaptive BO search, and (coming soon) a cluster-distributed BO search are seven cardinalities of one pipeline.
BenchmarkPlandescribes what to run,MultiRunOrchestratordecides when and in what order, an optionalSearchPlannerdecides what to try next, and aRunExecutordecides how to actually run one cell.%%{init: {'flowchart': {'nodeSpacing': 50, 'rankSpacing': 60, 'htmlLabels': true}, 'themeVariables': {'fontSize': '14px'}}}%% flowchart LR cfg["**config**<br/>AIPerfConfig"] exp["**expand**<br/>into N variations<br/>(BenchmarkPlan)"] run["**run**<br/>each variation<br/>M trials<br/>(via RunExecutor)"] agg["**aggregate**<br/>SweepAnalyzer<br/>-> sweep_aggregate/"] cfg --> exp --> run --> agg subgraph BACKENDS["RunExecutor backend (swap point)"] local["LocalSubprocessExecutor<br/>(today)"] k8s["K8sChildJobExecutor<br/><i>coming soon</i>"] end run -. selects .-> local run -. selects .-> k8s classDef stage fill:#e3f2fd,stroke:#1565c0,stroke-width:1.5px,color:#0d47a1 classDef shipping fill:#e8f5e9,stroke:#2e7d32,stroke-width:1.5px,color:#1b5e20 classDef coming fill:#fafafa,stroke:#9e9e9e,stroke-width:1.5px,stroke-dasharray:5 3,color:#616161 class cfg,exp,run,agg stage class local shipping class k8s coming style BACKENDS fill:transparent,stroke:#78909c,stroke-width:2px,stroke-dasharray:2 2Search recipes → AdaptiveSearchSweep → planner
A user can author an
AdaptiveSearchSweepdirectly undersweep:(low level) or pick asearch_recipeplugin (high level) that builds one from a recipe + the user's existing benchmark config.%%{init: {'flowchart': {'nodeSpacing': 50, 'rankSpacing': 60, 'htmlLabels': true}, 'themeVariables': {'fontSize': '14px'}}}%% flowchart TB subgraph IN["user inputs"] cli["**--search-recipe NAME --param k=v**<br/>or<br/>**sweep: { type: adaptive_search, … }** in YAML<br/>or<br/>--search-space PATH:LO,HI[:KIND]<br/>--search-metric METRIC<br/>--search-stat STAT<br/>--search-direction DIRECTION<br/>--search-sla metric:stat:op:threshold (×N)"] uc["**AIPerfConfig.benchmark**<br/><i>(models, endpoint, phases, …)</i>"] end subgraph RECIPE["recipe layer (optional)"] ctx["**SearchRecipeContext**<br/><i>(benchmark_config, sla_targets,<br/>sweep_overrides)</i>"] rc["**SearchRecipe** plugin (Protocol)<br/><i>built-ins:</i><br/>max-throughput-ttft-sla<br/>max-throughput-itl-sla<br/>concurrency-ramp<br/>prefill-ttft-curve / decode-itl-curve<br/>max-goodput-under-slo<br/>max-concurrency-under-sla<br/>pareto-sweep"] out["**SearchRecipeOutput**<br/><i>(exactly one of:<br/>adaptive_search | sweep_parameters | scenarios)</i><br/>+ sla_filters, slos, post_process"] end subgraph CFG["adaptive sweep variant"] asc["**AdaptiveSearchSweep**<br/><i>(SweepConfig variant,<br/>type=adaptive_search)</i><br/>search_space, objectives,<br/>max_iterations, sla_filters,<br/>post_process, planner, …"] end subgraph DRIVE["runtime drivers"] plan["**AIPerfConfig.sweep**<br/>= AdaptiveSearchSweep"] plan2["**BenchmarkPlan.sweep**<br/><i>(is_adaptive_search is true)</i>"] sp["**SearchPlanner** plugin<br/><i>(BayesianSearchPlanner |<br/>MonotonicSLASearchPlanner |<br/>SmoothIsotonicSLAPlanner |<br/>OptunaSearchPlanner)</i>"] pph["**search_recipe_post_process** plugin<br/><i>(degradation_knee_detect, ttft_curve_fit,<br/>itl_surface_fit, sla_breach_knee,<br/>pareto_sweep_export)</i>"] end cli --> rc uc --> ctx --> rc --> out --> asc cli -.direct path.-> asc asc --> plan plan --> plan2 plan2 --> sp plan2 --> pph classDef data fill:#e3f2fd,stroke:#1565c0,stroke-width:1.5px,color:#0d47a1 classDef proc fill:#fff3e0,stroke:#e65100,stroke-width:1.5px,color:#bf360c classDef decision fill:#f3e5f5,stroke:#6a1b9a,stroke-width:1.5px,color:#4a148c classDef art fill:#e8f5e9,stroke:#2e7d32,stroke-width:1.5px,color:#1b5e20 class cli,uc data class ctx,rc proc class out,asc art class plan,plan2,sp,pph decisionAdaptive search —
propose → execute → recordloopThe BO outer loop is a
propose -> execute -> recordcycle insideMultiRunOrchestrator.execute_adaptive_search.BenchmarkRunandRunExecutorare unchanged from the grid path; the difference is thatBenchmarkPlan.configsstarts with one seed config and grows by one per iteration as the planner asks for the next point.%%{init: {'sequence': {'mirrorActors': false}, 'themeVariables': {'fontSize': '14px'}}}%% sequenceDiagram autonumber participant Plan as BenchmarkPlan<br/>(sweep is AdaptiveSearchSweep) participant Orch as MultiRunOrchestrator participant Pl as SearchPlanner<br/>(Bayesian / Monotonic / Optuna) participant Run as BenchmarkRun participant Exec as RunExecutor participant Res as RunResult participant PP as PostProcessHandler participant Out as search_history.json /<br/>sweep_aggregate Orch->>Pl: planner instantiated upstream<br/>(via _build_search_planner)<br/>and passed into execute loop until converged or max_iterations Orch->>Pl: ask Pl-->>Orch: (BenchmarkConfig_k, SweepVariation_k)<br/>or None (converged -> convergence_reason) alt got proposal Orch->>Orch: _run_independent_cell<br/>(fresh ExecutionStrategy per cell) loop trials inner (until strategy says stop) Orch->>Run: BenchmarkRun for cfg_k, variation_k, trial t, … Orch->>Exec: run the cell Exec-->>Res: RunResult end Orch->>Pl: tell with variation_k, cell_results Pl->>Pl: filter by SLAFilter,<br/>compute objective scalar,<br/>plateau / patience / max-iter check Orch-->>Out: write_search_history<br/>(incremental, includes<br/>boundary_summary if planner has it) end end Orch->>PP: process the sweep_aggregate with params<br/>(per PostProcessSpec on sweep) PP-->>Out: knees, curve fits, … Orch-->>Out: profile_export_aiperf_sweep.{json,csv}Test plan
uv run pytest -n auto tests/unit/uv run pytest -n auto tests/unit/property/(finite/numeric invariants)uv run pytest -n auto -m component_integrationuv run pytest -n auto -m integrationmake validate-plugin-schemasmake generate-all-docsis idempotent (pre-commit confirms)aiperf run -f src/aiperf/config/templates/minimal.yaml,latency_test.yaml,sweep_with_plot.yamlaiperf plot <artifact_dir>reproduces the auto-plot envelope for a sweep runmax_concurrency_under_sla) and verifysearch_history.jsonlexport🤖 Generated with Claude Code