Skip to content

feat: YAML-native v2 config + adaptive sweep orchestrator with BO & search recipes#912

Merged
ajcasagrande merged 53 commits into
mainfrom
ajc/sweep-orchestrator-port
May 15, 2026
Merged

feat: YAML-native v2 config + adaptive sweep orchestrator with BO & search recipes#912
ajcasagrande merged 53 commits into
mainfrom
ajc/sweep-orchestrator-port

Conversation

@ajcasagrande

@ajcasagrande ajcasagrande commented May 11, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Sweep orchestrator port with a new pluggable executor (orchestrator/executor.py, local_executor.py) and a search_planner/ package: Bayesian (BoTorch GP/DSP kernel), Optuna (multi-objective), Monotonic, and Smooth-Isotonic planners, with shared helpers for cliff detection, margin normalization, replicate budget, pooled percentile, and SLA constraints. Aggregation gains SLA filtering and multi-objective search-history export.
  • YAML-driven AIPerf config (src/aiperf/config/) replaces the old common/config/ package. Layered design: typed models, loader (Jinja2 templating, env-var interpolation, dotted-path overrides, duration parsing, strict-undefined plan), flags converter (CLI ↔ YAML), resolution layer, sweep DSL (grid, QMC/Sobol, adaptive, multi-run, distributions), public JSON schema, plus a bundled library of 20+ ready-to-run templates and reference trace data.
  • Search recipes (src/aiperf/search_recipes/) with built-ins: max_concurrency_under_sla, max_goodput_under_slo, sla_breach_knee, itl_surface_fit, ttft_curve_fit, and Pareto sweep (axes, dominance, export, parser) plus post-process hooks.
  • CLI runner refactored into _cli_runner_{helpers,sweep_helpers,post_process}.py + _sweep_table_logger.py; new aiperf config command for template discovery and validation.
  • Auto-plot envelope (plot/auto_plot.py) materializes the resolved plot config into the artifact dir so aiperf plot <dir> reproduces. PlotEnvelopeConfig lets one YAML own its visualization.
  • Finite/NaN invariants: new common/finite.py (FiniteFloat, scrub_non_finite, nan_safe_mean/std, is_finite_value); property-test corpus with ratcheted baselines in tests/unit/property/.
  • Mock-server scheduler for deterministic latency simulation plus robustness/scheduler test suites.
  • Plugin schema extended with orchestrator categories (executor, planner, recipe).
  • Docs: new tutorials (sweeps, adaptive search, auto-plot, inline datasets, YAML config, YAML distributions), dev docs (sweep orchestrator design, global invariants, YAML config roadmap), sweeping reference (Bayesian optimization, search recipes, space-filling), troubleshooting/sweeps, and API reference for search history. Regenerated CLI and env-vars docs.
  • Tests: extensive new unit/component/integration coverage under tests/unit/{config,orchestrator,search_recipes,search_planner,cli_runner,property}/ and adversarial chaos scripts under tests/scripts/chaos/.

766 files changed (+111246, -24748). Full design write-up at docs/dev/sweep-orchestrator.md.

Architecture

One pipeline at every cardinality

A single benchmark, a multi-run for confidence intervals, a grid/scenarios sweep, a Sobol/LHS characterization, an adaptive BO search, and (coming soon) a cluster-distributed BO search are seven cardinalities of one pipeline. BenchmarkPlan describes what to run, MultiRunOrchestrator decides when and in what order, an optional SearchPlanner decides what to try next, and a RunExecutor decides how to actually run one cell.

%%{init: {'flowchart': {'nodeSpacing': 50, 'rankSpacing': 60, 'htmlLabels': true}, 'themeVariables': {'fontSize': '14px'}}}%%
flowchart LR
    cfg["**config**<br/>AIPerfConfig"]
    exp["**expand**<br/>into N variations<br/>(BenchmarkPlan)"]
    run["**run**<br/>each variation<br/>M trials<br/>(via RunExecutor)"]
    agg["**aggregate**<br/>SweepAnalyzer<br/>-> sweep_aggregate/"]

    cfg --> exp --> run --> agg

    subgraph BACKENDS["RunExecutor backend (swap point)"]
        local["LocalSubprocessExecutor<br/>(today)"]
        k8s["K8sChildJobExecutor<br/><i>coming soon</i>"]
    end
    run -. selects .-> local
    run -. selects .-> k8s

    classDef stage fill:#e3f2fd,stroke:#1565c0,stroke-width:1.5px,color:#0d47a1
    classDef shipping fill:#e8f5e9,stroke:#2e7d32,stroke-width:1.5px,color:#1b5e20
    classDef coming fill:#fafafa,stroke:#9e9e9e,stroke-width:1.5px,stroke-dasharray:5 3,color:#616161

    class cfg,exp,run,agg stage
    class local shipping
    class k8s coming

    style BACKENDS fill:transparent,stroke:#78909c,stroke-width:2px,stroke-dasharray:2 2
Loading

Search recipes → AdaptiveSearchSweep → planner

A user can author an AdaptiveSearchSweep directly under sweep: (low level) or pick a search_recipe plugin (high level) that builds one from a recipe + the user's existing benchmark config.

%%{init: {'flowchart': {'nodeSpacing': 50, 'rankSpacing': 60, 'htmlLabels': true}, 'themeVariables': {'fontSize': '14px'}}}%%
flowchart TB
    subgraph IN["user inputs"]
        cli["**--search-recipe NAME --param k=v**<br/>or<br/>**sweep: { type: adaptive_search, … }** in YAML<br/>or<br/>--search-space PATH:LO,HI[:KIND]<br/>--search-metric METRIC<br/>--search-stat STAT<br/>--search-direction DIRECTION<br/>--search-sla metric:stat:op:threshold (×N)"]
        uc["**AIPerfConfig.benchmark**<br/><i>(models, endpoint, phases, …)</i>"]
    end

    subgraph RECIPE["recipe layer (optional)"]
        ctx["**SearchRecipeContext**<br/><i>(benchmark_config, sla_targets,<br/>sweep_overrides)</i>"]
        rc["**SearchRecipe** plugin (Protocol)<br/><i>built-ins:</i><br/>max-throughput-ttft-sla<br/>max-throughput-itl-sla<br/>concurrency-ramp<br/>prefill-ttft-curve / decode-itl-curve<br/>max-goodput-under-slo<br/>max-concurrency-under-sla<br/>pareto-sweep"]
        out["**SearchRecipeOutput**<br/><i>(exactly one of:<br/>adaptive_search | sweep_parameters | scenarios)</i><br/>+ sla_filters, slos, post_process"]
    end

    subgraph CFG["adaptive sweep variant"]
        asc["**AdaptiveSearchSweep**<br/><i>(SweepConfig variant,<br/>type=adaptive_search)</i><br/>search_space, objectives,<br/>max_iterations, sla_filters,<br/>post_process, planner, …"]
    end

    subgraph DRIVE["runtime drivers"]
        plan["**AIPerfConfig.sweep**<br/>= AdaptiveSearchSweep"]
        plan2["**BenchmarkPlan.sweep**<br/><i>(is_adaptive_search is true)</i>"]
        sp["**SearchPlanner** plugin<br/><i>(BayesianSearchPlanner |<br/>MonotonicSLASearchPlanner |<br/>SmoothIsotonicSLAPlanner |<br/>OptunaSearchPlanner)</i>"]
        pph["**search_recipe_post_process** plugin<br/><i>(degradation_knee_detect, ttft_curve_fit,<br/>itl_surface_fit, sla_breach_knee,<br/>pareto_sweep_export)</i>"]
    end

    cli --> rc
    uc --> ctx --> rc --> out --> asc
    cli -.direct path.-> asc

    asc --> plan
    plan --> plan2
    plan2 --> sp
    plan2 --> pph

    classDef data fill:#e3f2fd,stroke:#1565c0,stroke-width:1.5px,color:#0d47a1
    classDef proc fill:#fff3e0,stroke:#e65100,stroke-width:1.5px,color:#bf360c
    classDef decision fill:#f3e5f5,stroke:#6a1b9a,stroke-width:1.5px,color:#4a148c
    classDef art fill:#e8f5e9,stroke:#2e7d32,stroke-width:1.5px,color:#1b5e20

    class cli,uc data
    class ctx,rc proc
    class out,asc art
    class plan,plan2,sp,pph decision
Loading

Adaptive search — propose → execute → record loop

The BO outer loop is a propose -> execute -> record cycle inside MultiRunOrchestrator.execute_adaptive_search. BenchmarkRun and RunExecutor are unchanged from the grid path; the difference is that BenchmarkPlan.configs starts with one seed config and grows by one per iteration as the planner asks for the next point.

%%{init: {'sequence': {'mirrorActors': false}, 'themeVariables': {'fontSize': '14px'}}}%%
sequenceDiagram
    autonumber
    participant Plan as BenchmarkPlan<br/>(sweep is AdaptiveSearchSweep)
    participant Orch as MultiRunOrchestrator
    participant Pl as SearchPlanner<br/>(Bayesian / Monotonic / Optuna)
    participant Run as BenchmarkRun
    participant Exec as RunExecutor
    participant Res as RunResult
    participant PP as PostProcessHandler
    participant Out as search_history.json /<br/>sweep_aggregate

    Orch->>Pl: planner instantiated upstream<br/>(via _build_search_planner)<br/>and passed into execute

    loop until converged or max_iterations
        Orch->>Pl: ask
        Pl-->>Orch: (BenchmarkConfig_k, SweepVariation_k)<br/>or None (converged -> convergence_reason)
        alt got proposal
            Orch->>Orch: _run_independent_cell<br/>(fresh ExecutionStrategy per cell)
            loop trials inner (until strategy says stop)
                Orch->>Run: BenchmarkRun for cfg_k, variation_k, trial t, …
                Orch->>Exec: run the cell
                Exec-->>Res: RunResult
            end
            Orch->>Pl: tell with variation_k, cell_results
            Pl->>Pl: filter by SLAFilter,<br/>compute objective scalar,<br/>plateau / patience / max-iter check
            Orch-->>Out: write_search_history<br/>(incremental, includes<br/>boundary_summary if planner has it)
        end
    end

    Orch->>PP: process the sweep_aggregate with params<br/>(per PostProcessSpec on sweep)
    PP-->>Out: knees, curve fits, …
    Orch-->>Out: profile_export_aiperf_sweep.{json,csv}
Loading

Test plan

  • uv run pytest -n auto tests/unit/
  • uv run pytest -n auto tests/unit/property/ (finite/numeric invariants)
  • uv run pytest -n auto -m component_integration
  • uv run pytest -n auto -m integration
  • make validate-plugin-schemas
  • make generate-all-docs is idempotent (pre-commit confirms)
  • Smoke a few bundled templates: aiperf run -f src/aiperf/config/templates/minimal.yaml, latency_test.yaml, sweep_with_plot.yaml
  • Confirm aiperf plot <artifact_dir> reproduces the auto-plot envelope for a sweep run
  • Run a search recipe end-to-end (e.g. max_concurrency_under_sla) and verify search_history.jsonl export
  • Verify Bayesian and Smooth-Isotonic planners against a known SLA-cliff workload on the mock server

🤖 Generated with Claude Code

@github-actions

github-actions Bot commented May 11, 2026

Copy link
Copy Markdown

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@c357b7c48b1ec429317782209820491f88ecf396

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@c357b7c48b1ec429317782209820491f88ecf396

Last updated for commit: c357b7cBrowse code

@github-actions

github-actions Bot commented May 11, 2026

Copy link
Copy Markdown
@ajcasagrande ajcasagrande changed the title feat(orchestrator,config): port sweep orchestrator, YAML config, search recipes May 11, 2026
…g, search recipes

Major reorganization that ports the parameter-sweep orchestrator into a
first-class subsystem, introduces a full YAML-driven AIPerf configuration
language alongside the existing CLI, and ships a library of reusable
"search recipes" plus a Bayesian/adaptive search-planner stack.

Orchestrator
- New executor abstraction (`orchestrator/executor.py`, `local_executor.py`)
  decouples per-cell run execution from the orchestrator loop.
- New `search_planner/` package with pluggable planners:
  - `bayesian.py` + `_botorch_kernel.py` (BoTorch GP / DSP kernel)
  - `optuna_planner.py` + `_optuna_helpers.py` (multi-objective via Optuna)
  - `monotonic.py` + `_monotonic_boundary.py`
  - `smooth_isotonic.py` + `_smooth_isotonic_{fit,boundary,phases}.py`
  - Shared helpers: cliff detection, margin normalization, replicate
    budget, pooled percentile, SLA helpers, outcome constraints.
- Sweep aggregation grows `sweep_sla_filter.py` and multi-objective
  search-history export (`exporters/search_history.py`).
- New convergence strategy hooks, JSONL loader, subprocess runner
  refactors, and per-cell callbacks.

YAML configuration (`src/aiperf/config/`)
- Replaces the old `src/aiperf/common/config/` package with a layered
  design: typed config models, loader (Jinja2 templating, env-var
  interpolation, dotted-path overrides, duration parsing, normalizers,
  strict-undefined plan), flags converter (CLI <-> YAML), resolution
  layer (predicates + resolvers + plan), sweep DSL (grid, QMC/Sobol,
  adaptive, multi-run, sampling, distributions), and a public JSON
  schema (`config/schema/aiperf-config.schema.json`).
- Bundled template library (`config/templates/`): 20+ ready-to-run
  YAMLs covering minimal, latency, goodput SLO, long context, ramping,
  multi-turn, multimodal vision/audio, embeddings, fixed schedule,
  trace replay, KV cache test, multi-URL load balancing, sweep with
  plot, sweep distributions, warmup profiling, request cancellation,
  Jinja2 variables, env-var production, inline dataset, scenario
  workload profiles, GPU telemetry, HTTP trace metrics, user files,
  speed bench sweep, plus reference trace JSONL data.
- Communications config split into `comm/` (TCP, IPC, dual-bind, build).
- Dataset config split into `dataset/` (content, resolver, trace,
  video) with inline-record support.

Search recipes (`src/aiperf/search_recipes/`)
- New recipe registry with built-ins: `max_concurrency_under_sla`,
  `max_goodput_under_slo`, `sla_breach_knee`, `itl_surface_fit`,
  `ttft_curve_fit`, and Pareto sweep (axes, dominance, export, parser).
- Recipe post-process hooks with shared infrastructure.

CLI / runner
- `cli_runner.py` factored into `_cli_runner_helpers.py`,
  `_cli_runner_sweep_helpers.py`, `_cli_runner_post_process.py`, plus
  `_sweep_table_logger.py` for live progress rendering.
- New `aiperf config` command for template discovery and validation.
- Profile/plot/service commands updated for the new config layer.

Auto-plot envelope
- `plot/auto_plot.py` materializes a resolved plot envelope into the
  artifact dir so `aiperf plot <dir>` reproduces the chart pipeline.
- `PlotEnvelopeConfig` allows a single AIPerf YAML to own its
  visualization.

Finite / numeric invariants
- New `common/finite.py` with `FiniteFloat`, `scrub_non_finite`,
  `nan_safe_mean`, `nan_safe_std`, `is_finite_value`.
- Property test corpus (`tests/unit/property/`) with field/bounds
  baselines, finite invariants, Pydantic field fuzz, and config-dump
  round-trip checks. CI ratchets these to zero.

Metrics
- New `good_request_fraction_metric` for goodput SLO recipes.
- Records, exporters, and post-processors threaded with redaction and
  finite-value scrubbing on export boundaries.

Mock server (`tests/aiperf_mock_server/`)
- New deterministic `scheduler.py` for repeatable latency simulation.
- Robustness and scheduler test suites for CI coverage.

Plugins
- New orchestrator plugin categories and schema
  (`plugin/schema/_orchestrator_schemas.py`,
  `plugin/categories.yaml` updates) for executors, planners, recipes.

Testing
- Extensive new unit/component/integration suites under
  `tests/unit/{config,orchestrator,search_recipes,search_planner,
  cli_runner,property}/` and adversarial chaos scripts under
  `tests/scripts/chaos/`.
- New component-integration smoke tests for Sobol sweeps,
  multi-objective E2E, process-title, and recipe collapse-knee.

Docs
- New tutorials: sweeps, adaptive search, auto-plot, inline datasets,
  YAML config, YAML distributions.
- New developer docs: sweep orchestrator design, global invariants,
  YAML config roadmap and future goals.
- New sweeping reference: bayesian optimization, search recipes,
  space-filling designs; new troubleshooting/sweeps guide.
- New API reference: search history.
- Regenerated `docs/cli-options.md` and `docs/environment-variables.md`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
@ajcasagrande ajcasagrande force-pushed the ajc/sweep-orchestrator-port branch from 20e5ca9 to d2da6f2 Compare May 11, 2026 19:18
@ajcasagrande ajcasagrande changed the title feat(orchestrator,config): Adaptive search sweep orchestrator, YAML v2 config, search recipes May 11, 2026
@github-actions github-actions Bot added the feat label May 11, 2026
@ajcasagrande ajcasagrande changed the title feat: Adaptive search sweep orchestrator, YAML v2 config, search recipes May 11, 2026
@ajcasagrande ajcasagrande changed the title feat: YAML config language + sweep orchestrator with adaptive BO & recipes May 11, 2026
ajcasagrande and others added 5 commits May 11, 2026 12:30
Three issues caused 'Validate (and publish on main) synced docs' to fail:

1. `environment-variables.md:148` had a literal `{"tight": 20000}` JSON
   example in prose. MDX parses `{...}` as a JSX expression and `"tight":
   20000` is not a valid JS expression. Fixed by wrapping the override
   example in `` `` `` inline code in the source description string in
   `common/environment.py` (auto-generates the doc).

2. `dev/sweep-orchestrator.md:443` had a literal `<= max_iter` in a
   table cell. MDX sees `<` as the start of a JSX element and chokes
   on `=`. Replaced with the Unicode `≤` (already used elsewhere in
   the same doc for `×`).

3. Three relative `../../src/...` links to source files broke under
   `--strict-broken-links` because Fern publishes from `fern/pages-dev/`
   which has no `../../src/` parent. Converted to absolute
   `github.com/ai-dynamo/aiperf/blob/main/...` URLs, matching the
   pattern already used in `docs/api/search-history.md` and
   `docs/reproducibility.md`.

Verified locally by cloning the `docs-website` branch, syncing the PR's
`docs/` into `fern/pages-dev/`, running `md_to_mdx.py`, and getting
`fern check --warnings --strict-broken-links` to 0 errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
CI pre-commit `test-imports` hook was failing with:

    aiperf.orchestrator.search_planner._botorch_kernel:
      ModuleNotFoundError("No module named 'torch'")

torch/gpytorch live behind the `[optuna]` extra and are not installed
in the lint/pre-commit env. The module was importing them at top level
even though only `make_dsp_kernel` uses them. Moved the imports inside
the function and kept a TYPE_CHECKING import for the `ScaleKernel`
return annotation. Module now imports cleanly without the extra;
calling `make_dsp_kernel` without it still raises a clear
ModuleNotFoundError, matching the existing optuna-gated UX.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
…tests

`tests/unit/config/test_v1_file_dataset_rejections.py` hardcoded
``input_file="/tmp/mc.jsonl"`` with a comment claiming "path doesn't
have to exist for converter". That's false: ``CLIConfig.input_file``
runs the ``parse_file`` validator in
``src/aiperf/config/loader/parsing.py`` which requires the path to be
an existing file or directory. The tests only passed on dev machines
where ``/tmp/mc.jsonl`` happened to exist from prior runs; CI fails
all 7 cases with
``ValidationError: '/tmp/mc.jsonl' is not a valid file or directory``.

Switched to a per-test ``mc_jsonl`` fixture that creates an empty
JSONL under pytest's ``tmp_path``. The converter only reads the
*path*, not the file contents, so an empty file is sufficient.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Keep the four agent instruction files synchronized while moving detailed finite-value guidance to the canonical docs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@debermudez debermudez left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a thorough, well-structured architectural overhaul. The new MultiRunOrchestrator, v2 config system, and adaptive search planner are solid. The issues below are mostly low-severity cleanup items, with one medium-priority data-loss risk in the adaptive cancel path.

13 findings: 0 critical, 0 high, 1 medium, 9 low, 3 nit.

Comment thread src/aiperf/orchestrator/orchestrator.py
Comment thread src/aiperf/orchestrator/orchestrator.py
Comment thread src/aiperf/orchestrator/orchestrator.py
Comment thread src/aiperf/orchestrator/orchestrator.py
Comment thread src/aiperf/orchestrator/strategies.py
Comment thread src/aiperf/cli_runner.py Outdated
Comment thread src/aiperf/cli_runner.py Outdated
Comment thread pyproject.toml
Comment thread src/aiperf/_cli_runner_post_process.py Outdated
Comment thread src/aiperf/config/resolution/plan.py
ajcasagrande and others added 15 commits May 14, 2026 18:28
Merge the latest mainline changes into the sweep orchestrator branch while keeping the branch's config-v2 model authoritative. OTel and MLflow now live as first-class benchmark config groups with runtime call sites using native nested access only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Drop the BenchmarkConfig flat-forwarding properties so call sites read
cfg.artifacts, cfg.mlflow, cfg.otel, and run.benchmark_id directly. Tests
build real BenchmarkConfig / BenchmarkRun instead of CLIConfig DTO mocks.
The OTel fanout subprocess now consumes the native MLflowConfig.

Split src/aiperf/config/artifacts.py into one section per file matching
the benchmark fields: mlflow, otel, server_metrics, gpu_telemetry. Same
for TokenizerConfig, LoggingConfig, and SLOsConfig out of models.py and
runtime.py. External imports through aiperf.config are unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
…ugin metadata

Replace the three hard-coded dicts in user_config.py
(_LOCAL_COLLECTOR_KEYWORDS, _LOCAL_ONLY_COLLECTORS, _LOCAL_COLLECTOR_INSTALL_HINTS)
with a typed metadata schema on the gpu_telemetry_collector plugin category.
Adding a new local collector now only requires editing plugins.yaml:
`is_local: true` plus an `install_hint`. Module name defaults to the plugin
name when `import_module` is unset.

- New `GPUTelemetryCollectorMetadata` Pydantic class (is_local, import_module,
  install_hint) wired via `metadata_class` in categories.yaml.
- `get_gpu_telemetry_collector_metadata` helper + `_CATEGORY_METADATA_CLASSES`
  registration in plugin/plugins.py, matching the existing get_endpoint/plot/
  service_metadata pattern.
- Metadata populated for pynvml and amdsmi (dcgm relies on `is_local: false`
  default).
- user_config helpers `_local_collector_keywords`, `_is_local_collector`, and
  `_ensure_local_collector_importable` consult plugin metadata. The "Invalid
  GPU telemetry item" error message self-derives from the keywords dict, so a
  new local collector flows through with no edits to error text.
- New test `test_local_collector_discovered_dynamically_from_plugin_metadata`
  registers a fake collector via `mock_plugin` (and a matching enum extension)
  to prove selection, conflict detection, local-vs-URL guardrail, and error
  message derivation all generalize beyond pynvml/amdsmi.

Behavior preserved: every existing test passes because `str(GPUTelemetryCollectorType.PYNVML) == "pynvml"`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Move each *Defaults dataclass next to the config section it belongs to:
EndpointDefaults, OutputDefaults, MLflowDefaults, TokenizerDefaults,
ServiceDefaults into endpoint/artifacts/mlflow/tokenizer/runtime modules.
Dataset and prompt modality defaults move to dataset/defaults.py. The
aggregator src/aiperf/config/defaults.py is deleted. External imports
through aiperf.config are unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Drive local-collector classification from plugin metadata
(GPUTelemetryCollectorMetadata) so adding a new local collector only
requires editing plugins.yaml. The original PR refactored the v1
aiperf.common.config.user_config module that no longer exists on this
branch; the equivalent metadata-driven validation now lives on
GpuTelemetryConfig (local-vs-URL guardrail and install-hint surfacing)
and the CLI converter derives local keywords from plugin metadata. The
AMD ROCm amdsmi collector ships with this merge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Medium:
- execute_adaptive_search: cancel path now writes search_history.json
  before returning (previously left stale history after cancellation).
  Factored a `_flush_history` closure to share the write across the
  cancel, converged, and per-iteration paths.

Low / nit cleanup:
- Add `SearchPlanner.iter_count` property; orchestrator no longer reads
  the private `planner._iter`.
- Remove dead `_build_convergence_criterion` copy in orchestrator.py
  (canonical version lives in `_cli_runner_helpers`).
- Drop the duplicate `_plan_iteration_order` in orchestrator.py; import
  the canonical one from `_cli_runner_sweep_helpers`.
- Remove dead `LocalSubprocessExecutor._write_redacted_config`
  (`EndpointConfig.api_key` field_serializer already redacts).
- Replace `# noqa: ANN202` placeholders with concrete return types on
  `_build_convergence_criterion`, `_build_search_planner`,
  `_maybe_compute_detailed`, and `_setup_ui_queues`.
- `_log_failed_sweep_variations`: extract a single `_format_key` helper
  and stop double-formatting the key string between the warning summary
  and the per-run loop.
- `_summarize_and_export`: run per-variation and sweep-aggregate exports
  concurrently under one `asyncio.run(gather(...))` instead of two
  sequential `asyncio.run` calls.
- Distinguish "1 successful run" vs "1 variation succeeded" wording in
  the aggregate-summary warning so sweep users see accurate language.
- Collapse `FixedTrialsStrategy._sanitize_label` into the module-level
  helper it duplicated verbatim.
- Remove the unused `_scrub_non_finite` shim in `_cli_runner_post_process`;
  the one test caller now imports `scrub_non_finite` from `aiperf.common.finite`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
…kage

Reviewer feedback flagged the leading-underscore-at-package-root layout
as unusual. Convert `cli_runner.py` into a package and move the three
`_cli_runner_*` helpers plus `_sweep_table_logger.py` in as private
submodules. No behavior change; pure relocation + import rewrites.

Mapping:
- src/aiperf/cli_runner.py                 -> src/aiperf/cli_runner/__init__.py
- src/aiperf/_cli_runner_helpers.py        -> src/aiperf/cli_runner/_helpers.py
- src/aiperf/_cli_runner_post_process.py   -> src/aiperf/cli_runner/_post_process.py
- src/aiperf/_cli_runner_sweep_helpers.py  -> src/aiperf/cli_runner/_sweep_helpers.py
- src/aiperf/_sweep_table_logger.py        -> src/aiperf/cli_runner/_sweep_table_logger.py

All imports in src/, tests/, docs/, and the ruff/ergonomics baselines
updated to match. Pre-commit and the full 11,736-test unit suite pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
`cli_runner/__init__.py` was 790 lines (grandfathered above the 500-line
ergonomics cap). Split the obvious independent concerns into siblings;
__init__.py keeps the orchestrating entry points + the helpers that need
to share the module namespace with test mock targets.

New layout:
- _callbacks.py        CompletedRun, OnComplete, _invoke_callbacks
- _preflight.py        _preflight_artifact_dir/_fd_limit/_endpoint_ready
- _process_setup.py    mp start method, log queue, FD_CLOEXEC, tokenizer preload
- _single_run.py       _run_single_benchmark
- _failure_summary.py  _log_failed_sweep_variations
- __init__.py          run_benchmark + multi-run orchestration + helpers

__init__.py drops 790 -> 409 lines, falling out of the ergonomics
baseline. All eleven `patch("aiperf.cli_runner.<name>")` targets remain
reachable through __init__.py imports. test_cli_runner_macos.py imports
the _process_setup helpers from their new path directly.

11,736 unit tests pass; all pre-commit hooks pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Previous layout was organized by 'what kind of helper' (preflight,
callbacks, process-setup) with two grab-bag files (_helpers.py,
_sweep_helpers.py) and an __init__.py that mixed dispatch with
multi-run orchestration. This pass organizes by the package's three
real domains: execution, aggregation, display.

Deleted:
  _helpers.py            distributed across _strategy/_aggregate/_banner
  _failure_summary.py    folded into _multi_run.py (its only caller)

Renamed:
  _sweep_helpers.py      -> _sweep_aggregate.py
  _sweep_table_logger.py -> _sweep_table.py

Created:
  _strategy.py       build_strategy, _build_convergence_criterion,
                     _build_search_planner, validate_convergence_config
  _aggregate.py      aggregate_and_export, print_aggregate_summary,
                     _maybe_compute_detailed, priority-metric block printers
  _banner.py         log_multi_run_banner, _log_search_planner_active
  _multi_run.py      _run_multi_benchmark + _execute_multi_benchmark +
                     _summarize_and_export + _estimate_and_log_duration +
                     _validate_multi_benchmark_plan +
                     _reject_in_process_sweep_under_operator +
                     _log_failed_sweep_variations (was _failure_summary)

__init__.py drops 415 -> 120 lines and now contains only the public
surface (run_benchmark, CompletedRun, OnComplete, _make_benchmark_run)
plus re-imports of the run_benchmark-layer patch targets (_preflight_*,
_run_single_benchmark, _run_multi_benchmark) so existing dispatch
tests keep working without touching their patches.

Test patches for multi-run internals (aggregate_and_export,
_estimate_and_log_duration, _summarize_and_export, build_strategy,
_build_search_planner, _log_search_planner_active) now correctly target
the call site (`aiperf.cli_runner._multi_run.<name>`) following the
standard 'patch where it's looked up' rule.

ruff_baseline.json and ergonomics_baseline.json updated for the new
paths; docs/troubleshooting/sweeps.md and docs/dev/sweep-orchestrator.md
updated likewise.

11,736 unit tests pass; all pre-commit hooks pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Move pareto-axes resolution and per-cell pareto projection out of
_sweep_aggregate.py into a dedicated _pareto.py:

  _resolve_pareto_axes    plugin-registry recipe lookup for pareto_axes
  _extract_axis_value     pull axis value from per-cell stats with fallbacks
  _aggregate_one_cell     project one variation's runs into a Pareto cell

_sweep_aggregate.py drops 775 -> 643 lines and now contains only the
per-variation + sweep-wide aggregation pipelines (no per-cell pareto
math). Lazy import inside _aggregate_one_cell avoids the cycle with
_sweep_aggregate's top-level _resolve_pareto_axes import.

External callers updated:
  - aiperf.orchestrator.orchestrator._fire_cell_callback
  - aiperf.cli_runner._sweep_table.SweepTableLogger
  - tests/unit/test_aggregate_one_cell.py
  - ruff_baseline.json (BLE001 entry path)

Also strip refactor-provenance docstrings and comments across the
codebase ("Lifted out of X", "Extracted from Y to keep that module
under the 500-line cap", "Factored out of Z so the helper exists",
etc.). The git history is the right place for that information; in
the code it's noise. Touched ~20 files in cli_runner/, search_recipes/,
config/, orchestrator/, plugin/.

11,736 unit tests pass; all pre-commit hooks pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
`_invoke_callbacks` and `_print_aggregate_summary` were re-exported at
the package surface solely so tests could import them at
`aiperf.cli_runner.<name>`. They have no production callers outside the
package. Move both off the public surface; update the two test files
that imported them to read from `_callbacks` and `_aggregate` directly.

`_print_aggregate_summary` was also an unnecessary rename of
`print_aggregate_summary` from `_aggregate` (the underscore prefix was
left over from when everything lived in a single `cli_runner.py`); tests
now use the real name.

11,736 unit tests pass; all pre-commit hooks pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Drive GPU telemetry collector setup through a shared candidate loop so DCGM and local collectors follow the same probe, baseline, and status flow while plugin metadata only handles config-time local classification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Three concrete fixes:

1. `__init__.py` `__all__` listed `_run_single_benchmark` and
   `_run_multi_benchmark` alongside public names. Underscore-prefix says
   "private" and __all__ says "public" — pick one. Both stay importable
   for tests; they're just no longer advertised as public API.

2. `_multi_run.py` had a `_ = (CompletedRun, OnComplete)` line that
   claimed to suppress an unused-import warning. Both names are actually
   used (CompletedRun constructor at line 104, OnComplete in two
   function signatures). The line was a leftover; remove. The trailing
   `__all__ = ["_run_multi_benchmark"]` also went — Python's default
   "names without leading underscore are public" makes it redundant here.

3. `__init__.py` docstring listed every helper submodule by name. Each
   module has its own docstring; this index was just maintenance
   burden. Trim to the actual public surface.

No behavior change; no test patches updated. 11,736 unit tests pass;
all pre-commit hooks pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Three private helpers had Any-typed parameters where the call-site
type is statically known:

- _log_search_planner_active(search_planner: Any, logger: Any) in
  _banner.py becomes (SearchPlanner | None, AIPerfLogger). The caller
  in _multi_run._execute_multi_benchmark already passes those types.
- _print_metric_block(metric, ...) in _aggregate.py becomes
  (metric: Any, ...) so the function has a complete signature (the
  upstream AggregateResult.metrics dict is dict[str, Any], so Any is
  the honest type here).
- _aggregate_one_cell(cell_results: list[Any], plan: Any,
  variation: Any) in _pareto.py becomes (list[RunResult],
  BenchmarkPlan, SweepVariation). Both callers
  (orchestrator._fire_cell_callback and aggregate_sweep_and_export)
  already pass those types.

No behavior change; pure annotation tightening.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Drop the underscore prefix on a function that is genuinely shared
across packages: cli_runner.run_benchmark, cli_commands.service, and
orchestrator.orchestrator (its own caller) all use it. Reaching into
another module for a leading-underscore "private" function is a smell;
the public name matches its actual usage.

Touches every call site (3 prod paths) plus 3 docstring/comment
references in tests and config/loader/plan.py.

No behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

@debermudez debermudez left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to go once its rebased on main. Very nice work!

ajcasagrande and others added 10 commits May 15, 2026 10:36
…ONL routing

- OTel: `--stream` now accepts a list; add `--otel-resource-attributes`
  with key=value parsing
- MLflow: rename to singular `--mlflow-tag` / `--mlflow-artifact-glob`;
  parse tags into a dict; surface schema-1.1 `count`/`sum` size fields
  in the exporter and the sweep aggregator
- Accuracy: `--accuracy-n-shots` becomes `int | None` (cap 32, defers to
  benchmark default); `--accuracy-enable-cot` becomes tri-state; tasks
  accept comma-separated lists
- Endpoint: prepend `http://` to schemeless URLs via AfterValidator;
  timeout now reads from `EndpointDefaults.TIMEOUT`
- Dataset: wire `DAG_JSONL` through the resolver and custom composer
  format map
- Server metrics: drop `parquet` from default formats
- Records pipeline: `OutputsJsonRecordProcessor` takes `run` (BenchmarkRun)
  instead of bare `cfg`; `RecordsManager` reads `self.run.cfg.otel`
- Schema: regenerate `aiperf-config.schema.json` with otel/mlflow
  sections and `image_edit` endpoint enum

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
The bulk of the v1 user_config.py changes from #941 (plugin-driven
local-collector keyword detection, validate_environment on collector
classes, generic conflict-detection wording) are already pre-migrated
into this branch's new src/aiperf/config/ package — see
GpuTelemetryConfig.validate_collector_compatibility() in
src/aiperf/config/gpu_telemetry.py and the warning hook in
src/aiperf/config/flags/_converter_telemetry.py.

Manual resolutions:
- src/aiperf/common/config/user_config.py: deleted (modify side from main
  is already represented in the new config/ package).
- src/aiperf/gpu_telemetry/manager.py: adopt main's _collector_candidates /
  _configure_reachable_collectors / _capture_collector_baseline plugin-
  dispatch design; keep branch's BenchmarkRun-driven constructor and
  gpu_telemetry_cfg.* access pattern; drop the legacy
  _configure_pynvml_collector / _configure_amdsmi_collector /
  _configure_dcgm_collectors helpers.
- tests/unit/common/config/test_user_config.py: take branch (legacy v1
  CLIConfig smoke tests only; UserConfig-validator tests from main no
  longer apply since the v1 module is gone).
- tests/unit/gpu_telemetry/test_telemetry_manager.py: keep both imports
  (make_run_from_cli from branch, mock_plugin from main).
- src/aiperf/plugin/schema/{schemas.py,plugins.schema.json,plugins.py}:
  keep branch's orchestrator metadata + main's GPUTelemetryCollectorMetadata
  side-by-side.
- tools/ergonomics_baseline.json: union of both file-size entries.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`normalize_http_url` checked only for ``://`` to decide whether to
prepend ``http://``, so a bare ``scheme:opaque`` form like
``javascript:alert(1)`` or ``data:text/plain;base64,xyz`` was
silently rewritten to ``http://javascript:alert(1)`` and then either:

  - rejected with the wrong message ("invalid port" instead of
    "missing scheme or host"), or
  - SILENTLY ACCEPTED when the opaque part happened to be all digits —
    ``javascript:1234`` became ``http://javascript:1234`` with
    host=``javascript``, port=1234, a real validation bypass.

Leave the URL alone when the colon prefix is a recognized foreign URI
scheme (javascript, data, file, ftp, ftps, sftp, ssh, gopher, ldap[s],
mailto, tel, vbscript, ws, wss) so the downstream EndpointConfig
validator can reject it as "missing scheme or host". ``localhost:8000``
and ``host:port`` shorthand still work because they don't match a
known foreign scheme.

Fixes the canary test
``tests/unit/transports/test_build_url_adversarial.py::
test_endpoint_validator_rejects_garbage[javascript-scheme]``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The new DatasetResolver had two bugs that wedged sagemaker_data_capture
runs with --fixed-schedule before the loader ever ran:

- _check_timing_data only checked top-level `timestamp`/`delay` keys, so
  sagemaker records (timing under `eventMetadata.inferenceTime`) were
  flagged as having no timing data. Add a per-type branch and pass
  `dataset_type` through.
- _resolve_one skipped structural auto-detection whenever `ds.format`
  was truthy, but Pydantic defaults `format` to SINGLE_TURN. Result: the
  resolver pre-validated against SINGLE_TURN when the user relied on
  auto-detect. Use `model_fields_set` so detection runs unless the user
  explicitly set `format`, matching how the composer infers type at
  load time.

Fixes the 5 sagemaker integration tests in
tests/integration/test_sagemaker_data_capture.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Ports tools/generate_config_schema.py from ajc/k8s-rework so the
hand-maintained aiperf-config.schema.json can be regenerated from the
Pydantic models. Adds `make generate-config-schema` /
`make check-config-schema` targets and a `generate-config-schema`
pre-commit hook that regenerates on AIPerfConfig changes.

Also folds in stray cleanups picked up while touching this area:
- CustomDatasetComposer._format_to_loader_type: replace hand-maintained
  dict with direct CustomDatasetType(fmt.value) — both enums mirror the
  custom_dataset_loader plugin registry and share string values.
- test_dag_timing_pathology: fix British→US spellings flagged by
  codespell.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Brings in 6 main PRs: #710 (--image-source flag, multimodal validator
hardening), #945 (--session-header), #884 (Agentic Code dataset docs),
#826 (UTF-8 for text/JSON file reads), #942 (per-request `extra`
payload), #943 (agentic coding dataset files).

The v1 ``src/aiperf/common/config/`` modules were deleted on this
branch as part of the config v2 restructure, so the corresponding
modify/delete conflicts keep the deletes and the relevant main-side
features are ported into the v2 surface:

- ``--image-source`` (PR #710): adds an ``ImageSource | Path`` field
  with ``BeforeValidator`` coercion to ``aiperf.config.dataset.content.ImageConfig``
  plus an ``images_enabled()`` helper. ``ImageGenerator`` dispatches per
  source mode (ASSETS, NOISE, custom Path) matching main's behavior.
- ``--session-header`` (PR #945): adds the field to ``EndpointConfig``,
  routes it through ``_converter_endpoint`` / ``_section_fields`` /
  ``CLIConfig`` so the flag round-trips into ``EndpointInfo.session_header``
  (already wired through ``base_transports``).

UTF-8 fix from PR #826 is applied to ``BaseFileLoader._iter_record_dicts``
so the encoding fix flows through every loader that uses the helper.

Test conflicts:
- ``tests/unit/common/config/test_user_config.py``: kept the v2-flavor
  smoke tests (the v1 alternate body tested deleted v1 modules and was
  not portable).
- ``tests/unit/dataset/generator/test_image_generator.py``: kept the v2
  ``make_image_config`` helper, added a ``source`` knob, aliased the v1
  ``ImageWidthConfig`` / ``ImageHeightConfig`` to ``NormalDistribution``
  so PR #710's new test classes (Noise mode, custom directory, disabled)
  read naturally, and added ``batch_size=1`` to the ImageConfig sites
  those tests construct directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…d --image-source

Doc drift from prior commits on the branch; regenerated by the
generate-cli-docs pre-commit hook.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
… env-var/Jinja in FixedDistribution shorthand

`${VAR}` whole-string substitutions now coerce to bool/int/float
using the same rules as Jinja, so `isl: ${AIPERF_TEST_ISL}` resolves
to a numeric distribution scalar. The schema generator emits matching
string-pattern branches under `FixedDistribution` so YAMLs using these
forms validate cleanly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
The strict-xfail sanity test for Responses-endpoint
`max_tokens=0` emission was a tripwire that no longer adds signal —
the None-check semantics are exercised by the surrounding positive
tests. Removing the inverted assertion to keep the suite focused on
direct behavioural assertions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented May 15, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ajcasagrande ajcasagrande enabled auto-merge (squash) May 15, 2026 19:43
`tests/unit/config/test_config_schema_generator_integration.py`
imports `jsonschema.Draft202012Validator` but the dep was never
declared — it was resolved transitively in local envs, so CI's
test-imports check fails with `ModuleNotFoundError: No module named
'jsonschema'`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
@ajcasagrande ajcasagrande merged commit 94a9102 into main May 15, 2026
20 of 26 checks passed
@ajcasagrande ajcasagrande deleted the ajc/sweep-orchestrator-port branch May 15, 2026 19:49
saturley-hall added a commit that referenced this pull request May 19, 2026
PR #912 (commit 94a9102) rewrote tokenizer_validator.py and introduced
a ProcessPoolExecutor-based HF cache prefetch in validate_tokenizer_early.
The skip-prefetch gate ANDed HF_HUB_OFFLINE and TRANSFORMERS_OFFLINE, but
the component_integration conftest only set HF_HUB_OFFLINE, so the gate
never engaged and the prefetch subprocesses ran. Those subprocesses
bypass the in-process Tokenizer.from_pretrained patch (subprocess
re-imports HF cleanly) and try to write to the real HF cache. In
restricted environments (Linux CI containers, sandboxes) or under
concurrent test execution that races on the cache directory, the write
fails with EPERM and aiperf aborts with "Configuration resolution failed:
[Errno 1] Operation not permitted" -- which surfaced as `request_count
== 0` in every component_integration test that runs `aiperf profile ...`
on Linux CI (ubuntu-latest and the new builder pool).

This has been the cause of run-unit-tests failures on main since
2026-05-15.

- src/aiperf/common/tokenizer_validator.py: change the skip-prefetch
  gate from AND to OR. Either env var being set is enough -- both mean
  "I have a warm cache, do not touch the network/disk." Requiring both
  was overly conservative and is what masked the test-harness bug.

- tests/component_integration/conftest.py: hf_offline_mode fixture now
  sets both HF_HUB_OFFLINE and TRANSFORMERS_OFFLINE so the prefetch
  skip-gate engages even on older aiperf builds where the gate is still
  ANDed. Restoring prior values for both on teardown.

Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
debermudez added a commit that referenced this pull request May 19, 2026
…ked (AIP-877)

Adds HellaSwag commonsense reasoning benchmark + the strict ExactMatchGrader
that DeepEval's ``Scorer.exact_match_score`` uses. Both align byte-for-
byte with the trt-llm benchmark recipe's DeepEval-backed configuration.

**HellaSwagBenchmark** (`src/aiperf/accuracy/benchmarks/hellaswag.py`)
- Loads ``Rowan/hellaswag``: validation split filtered per task by
  ``activity_label``, train split feeds the "one few-shot per unique
  activity_label" rule (mirrors ``deepeval.benchmarks.HellaSwag``'s
  ``categories_seen`` dedupe loop).
- Prompt rendering delegates to
  ``deepeval.benchmarks.hellaswag.template.HellaSwagTemplate.generate_output``
  — output is byte-equal to what the trt-llm recipe ships.
- Defaults: ``n_shots=10`` (DeepEval cap is 15), ``generation_size=5``,
  ``default_grader=exact_match``.
- Ground truth: bare ``A``/``B``/``C``/``D`` letter (DeepEval's
  convention for ``Scorer.exact_match_score``).
- ``_resolve_tasks`` matches activity labels case-insensitively via a
  lowercased-value map; falls back to upper-snake-case enum name
  (``HellaSwagTask.APPLYING_SUNSCREEN`` form) for the recipe's
  ``getattr(HellaSwagTask, name.upper(), None)`` parity.

**ExactMatchGrader** (`src/aiperf/accuracy/graders/exact_match.py`)
- Strict ``pred.strip() == gold.strip()`` semantics matching DeepEval's
  ``Scorer.exact_match_score``: case-sensitive, no normalization, empty
  response → ``unparsed=True``.
- Used by HellaSwag and (in AIP-878) BigBench-Hard for reference parity.

**Plugin registration** (`src/aiperf/plugin/plugins.yaml`)
- ``hellaswag`` → ``default_grader: exact_match``, ``default_n_shots: 10``
  with the DeepEval-backed description.
- ``exact_match`` → strict-equality description; drops the
  ``is_implemented: false`` flag.

**Dependencies** (`pyproject.toml`)
- Adds ``deepeval>=2.9.0`` to the ``[accuracy]`` optional-dependency
  group. Aiperf calls DeepEval's bundled prompt template directly so
  the dep is required for HellaSwag (and BigBench-Hard in AIP-878).

**Tests** (`tests/unit/accuracy/`)
- ``test_hellaswag_benchmark.py``: ~22 tests covering DeepEval prompt
  byte-equality, the unique-activity-label shots set rule, validation
  filtering, task resolution (exact, lower, upper, mixed case), and
  pathological dataset rows (empty validation, unlabeled rows).
- ``test_exact_match_grader.py``: strict-equality semantics including
  the empty-response → ``unparsed=True`` path and case-sensitivity.
- ``test_accuracy_config.py``: drops ``hellaswag`` from
  ``STUB_BENCHMARKS`` and ``exact_match`` from ``STUB_GRADERS``; the
  uppercase-stub test now uses ``BIGBENCH`` (a still-stub name).

**Docs** (`docs/accuracy/`)
- ``accuracy-benchmarking.md``: add HellaSwag row to the benchmarks
  table.
- ``accuracy_stubs.md``: status summary + move HellaSwag from "Still
  Stubbed" to "Implemented"; move ExactMatchGrader to "Implemented".

**Constructor signature**
Loader + grader use the v2 ``BenchmarkRun`` API (post-#912 refactor on
main) rather than the legacy ``UserConfig`` shape — matches how
``MMLUBenchmark`` and ``AIMEBenchmark`` are wired on current main.

Validation:
- 70/70 accuracy tests pass (HellaSwag + ExactMatch + AccuracyConfig).
- Ruff format + ruff check clean on all modified Python files.
- Codespell clean (v2.4.2, matches CI).
- HellaSwag prompts verified byte-equal against
  ``HellaSwagTemplate.generate_output`` on synthetic fixtures.

Reference:
- ``deepeval/benchmarks/hellaswag/hellaswag.py``
- ``deepeval/benchmarks/hellaswag/template.py``
- ``deepeval/scorer/scorer.py:Scorer.exact_match_score``
- ``trt-llm-benchmark-recipe/src/tools/acc_benchmark.py:319-336``

Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
debermudez added a commit that referenced this pull request May 19, 2026
…ked (AIP-877)

Adds HellaSwag commonsense reasoning benchmark + the strict ExactMatchGrader
that DeepEval's ``Scorer.exact_match_score`` uses. Both align byte-for-
byte with the trt-llm benchmark recipe's DeepEval-backed configuration.

**HellaSwagBenchmark** (`src/aiperf/accuracy/benchmarks/hellaswag.py`)
- Loads ``Rowan/hellaswag``: validation split filtered per task by
  ``activity_label``, train split feeds the "one few-shot per unique
  activity_label" rule (mirrors ``deepeval.benchmarks.HellaSwag``'s
  ``categories_seen`` dedupe loop).
- Prompt rendering delegates to
  ``deepeval.benchmarks.hellaswag.template.HellaSwagTemplate.generate_output``
  — output is byte-equal to what the trt-llm recipe ships.
- Defaults: ``n_shots=10`` (DeepEval cap is 15), ``generation_size=5``,
  ``default_grader=exact_match``.
- Ground truth: bare ``A``/``B``/``C``/``D`` letter (DeepEval's
  convention for ``Scorer.exact_match_score``).
- ``_resolve_tasks`` matches activity labels case-insensitively via a
  lowercased-value map; falls back to upper-snake-case enum name
  (``HellaSwagTask.APPLYING_SUNSCREEN`` form) for the recipe's
  ``getattr(HellaSwagTask, name.upper(), None)`` parity.

**ExactMatchGrader** (`src/aiperf/accuracy/graders/exact_match.py`)
- Strict ``pred.strip() == gold.strip()`` semantics matching DeepEval's
  ``Scorer.exact_match_score``: case-sensitive, no normalization, empty
  response → ``unparsed=True``.
- Used by HellaSwag and (in AIP-878) BigBench-Hard for reference parity.

**Plugin registration** (`src/aiperf/plugin/plugins.yaml`)
- ``hellaswag`` → ``default_grader: exact_match``, ``default_n_shots: 10``
  with the DeepEval-backed description.
- ``exact_match`` → strict-equality description; drops the
  ``is_implemented: false`` flag.

**Dependencies** (`pyproject.toml`)
- Adds ``deepeval>=2.9.0`` to the ``[accuracy]`` optional-dependency
  group. Aiperf calls DeepEval's bundled prompt template directly so
  the dep is required for HellaSwag (and BigBench-Hard in AIP-878).

**Tests** (`tests/unit/accuracy/`)
- ``test_hellaswag_benchmark.py``: ~22 tests covering DeepEval prompt
  byte-equality, the unique-activity-label shots set rule, validation
  filtering, task resolution (exact, lower, upper, mixed case), and
  pathological dataset rows (empty validation, unlabeled rows).
- ``test_exact_match_grader.py``: strict-equality semantics including
  the empty-response → ``unparsed=True`` path and case-sensitivity.
- ``test_accuracy_config.py``: drops ``hellaswag`` from
  ``STUB_BENCHMARKS`` and ``exact_match`` from ``STUB_GRADERS``; the
  uppercase-stub test now uses ``BIGBENCH`` (a still-stub name).

**Docs** (`docs/accuracy/`)
- ``accuracy-benchmarking.md``: add HellaSwag row to the benchmarks
  table.
- ``accuracy_stubs.md``: status summary + move HellaSwag from "Still
  Stubbed" to "Implemented"; move ExactMatchGrader to "Implemented".

**Constructor signature**
Loader + grader use the v2 ``BenchmarkRun`` API (post-#912 refactor on
main) rather than the legacy ``UserConfig`` shape — matches how
``MMLUBenchmark`` and ``AIMEBenchmark`` are wired on current main.

Validation:
- 70/70 accuracy tests pass (HellaSwag + ExactMatch + AccuracyConfig).
- Ruff format + ruff check clean on all modified Python files.
- Codespell clean (v2.4.2, matches CI).
- HellaSwag prompts verified byte-equal against
  ``HellaSwagTemplate.generate_output`` on synthetic fixtures.

Reference:
- ``deepeval/benchmarks/hellaswag/hellaswag.py``
- ``deepeval/benchmarks/hellaswag/template.py``
- ``deepeval/scorer/scorer.py:Scorer.exact_match_score``
- ``trt-llm-benchmark-recipe/src/tools/acc_benchmark.py:319-336``

Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
debermudez added a commit that referenced this pull request May 22, 2026
…ked (AIP-877)

Adds HellaSwag commonsense reasoning benchmark + the strict ExactMatchGrader
that DeepEval's ``Scorer.exact_match_score`` uses. Both align byte-for-
byte with the trt-llm benchmark recipe's DeepEval-backed configuration.

**HellaSwagBenchmark** (`src/aiperf/accuracy/benchmarks/hellaswag.py`)
- Loads ``Rowan/hellaswag``: validation split filtered per task by
  ``activity_label``, train split feeds the "one few-shot per unique
  activity_label" rule (mirrors ``deepeval.benchmarks.HellaSwag``'s
  ``categories_seen`` dedupe loop).
- Prompt rendering delegates to
  ``deepeval.benchmarks.hellaswag.template.HellaSwagTemplate.generate_output``
  — output is byte-equal to what the trt-llm recipe ships.
- Defaults: ``n_shots=10`` (DeepEval cap is 15), ``generation_size=5``,
  ``default_grader=exact_match``.
- Ground truth: bare ``A``/``B``/``C``/``D`` letter (DeepEval's
  convention for ``Scorer.exact_match_score``).
- ``_resolve_tasks`` matches activity labels case-insensitively via a
  lowercased-value map; falls back to upper-snake-case enum name
  (``HellaSwagTask.APPLYING_SUNSCREEN`` form) for the recipe's
  ``getattr(HellaSwagTask, name.upper(), None)`` parity.

**ExactMatchGrader** (`src/aiperf/accuracy/graders/exact_match.py`)
- Strict ``pred.strip() == gold.strip()`` semantics matching DeepEval's
  ``Scorer.exact_match_score``: case-sensitive, no normalization, empty
  response → ``unparsed=True``.
- Used by HellaSwag and (in AIP-878) BigBench-Hard for reference parity.

**Plugin registration** (`src/aiperf/plugin/plugins.yaml`)
- ``hellaswag`` → ``default_grader: exact_match``, ``default_n_shots: 10``
  with the DeepEval-backed description.
- ``exact_match`` → strict-equality description; drops the
  ``is_implemented: false`` flag.

**Dependencies** (`pyproject.toml`)
- Adds ``deepeval>=2.9.0`` to the ``[accuracy]`` optional-dependency
  group. Aiperf calls DeepEval's bundled prompt template directly so
  the dep is required for HellaSwag (and BigBench-Hard in AIP-878).

**Tests** (`tests/unit/accuracy/`)
- ``test_hellaswag_benchmark.py``: ~22 tests covering DeepEval prompt
  byte-equality, the unique-activity-label shots set rule, validation
  filtering, task resolution (exact, lower, upper, mixed case), and
  pathological dataset rows (empty validation, unlabeled rows).
- ``test_exact_match_grader.py``: strict-equality semantics including
  the empty-response → ``unparsed=True`` path and case-sensitivity.
- ``test_accuracy_config.py``: drops ``hellaswag`` from
  ``STUB_BENCHMARKS`` and ``exact_match`` from ``STUB_GRADERS``; the
  uppercase-stub test now uses ``BIGBENCH`` (a still-stub name).

**Docs** (`docs/accuracy/`)
- ``accuracy-benchmarking.md``: add HellaSwag row to the benchmarks
  table.
- ``accuracy_stubs.md``: status summary + move HellaSwag from "Still
  Stubbed" to "Implemented"; move ExactMatchGrader to "Implemented".

**Constructor signature**
Loader + grader use the v2 ``BenchmarkRun`` API (post-#912 refactor on
main) rather than the legacy ``UserConfig`` shape — matches how
``MMLUBenchmark`` and ``AIMEBenchmark`` are wired on current main.

Validation:
- 70/70 accuracy tests pass (HellaSwag + ExactMatch + AccuracyConfig).
- Ruff format + ruff check clean on all modified Python files.
- Codespell clean (v2.4.2, matches CI).
- HellaSwag prompts verified byte-equal against
  ``HellaSwagTemplate.generate_output`` on synthetic fixtures.

Reference:
- ``deepeval/benchmarks/hellaswag/hellaswag.py``
- ``deepeval/benchmarks/hellaswag/template.py``
- ``deepeval/scorer/scorer.py:Scorer.exact_match_score``
- ``trt-llm-benchmark-recipe/src/tools/acc_benchmark.py:319-336``

Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
debermudez added a commit that referenced this pull request May 22, 2026
Implements the BigBench-Hard accuracy benchmark by delegating prompt
rendering to ``deepeval.benchmarks.BigBenchHard``'s
``BigBenchHardTemplate.generate_output``. Output is byte-equal to the
trt-llm benchmark recipe's DeepEval-backed configuration so reference
parity is preserved end-to-end. Pairs with the existing
``ExactMatchGrader`` (landed via AIP-877) for the recipe's strict
``Scorer.exact_match_score`` semantics.

Loader uses the new ``BenchmarkRun`` constructor signature introduced
by PR #912 (no ``UserConfig``), and the test fixture wires through the
``make_benchmark_run`` conftest helper. ``deepeval`` is already pinned
in the ``[accuracy]`` extras via AIP-877 — the test guards on
``pytest.importorskip("deepeval")`` so the suite still runs without
the optional install.

Drops ``bigbench`` from ``STUB_BENCHMARKS``, removes ``is_implemented:
false`` from the ``plugins.yaml`` entry, and updates the accuracy docs
to reflect the new implemented status. The uppercase-stub validator
test now exercises ``LCB_CODEGENERATION`` since ``BIGBENCH`` is no
longer stubbed.

Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
debermudez added a commit that referenced this pull request May 22, 2026
Implements the BigBench-Hard accuracy benchmark by delegating prompt
rendering to ``deepeval.benchmarks.BigBenchHard``'s
``BigBenchHardTemplate.generate_output``. Output is byte-equal to the
trt-llm benchmark recipe's DeepEval-backed configuration so reference
parity is preserved end-to-end. Pairs with the existing
``ExactMatchGrader`` (landed via AIP-877) for the recipe's strict
``Scorer.exact_match_score`` semantics.

Loader uses the new ``BenchmarkRun`` constructor signature introduced
by PR #912 (no ``UserConfig``), and the test fixture wires through the
``make_benchmark_run`` conftest helper. ``deepeval`` is already pinned
in the ``[accuracy]`` extras via AIP-877 — the test guards on
``pytest.importorskip("deepeval")`` so the suite still runs without
the optional install.

Drops ``bigbench`` from ``STUB_BENCHMARKS``, removes ``is_implemented:
false`` from the ``plugins.yaml`` entry, and updates the accuracy docs
to reflect the new implemented status. The uppercase-stub validator
test now exercises ``LCB_CODEGENERATION`` since ``BIGBENCH`` is no
longer stubbed.

Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
debermudez added a commit that referenced this pull request May 27, 2026
Implement ``AIME24Benchmark`` to mirror the trt-llm benchmark recipe's
``acc_bench_lighteval.py`` configuration for AIME 2024:

    aime24 = LightevalTaskConfig(
        name="aime24",
        prompt_function=aime_prompt_fn,
        hf_repo="HuggingFaceH4/aime_2024",
        evaluation_splits=["train"],
        few_shots_split=None,
        few_shots_select=None,
        generation_size=32768,
        metric=[expr_gold_metric],
    )

The recipe's ``aime_prompt_fn`` produces a ``Doc`` whose ``query`` is
the bare problem text — lighteval's prompt manager wraps it as a
single user message with no instruction prefix and no few-shot
priming. The loader emits prompts the same way: one
``BenchmarkProblem`` per dataset row, ``prompt`` = the bare
``problem`` field, ``ground_truth`` = ``str(answer)``,
``metadata.generation_size`` = 32768. ``tasks`` / ``n_shots`` /
``enable_cot`` arguments are accepted for protocol uniformity but
ignored (any of them changing the prompt would diverge from the
reference). Pair with ``LightevalExprGrader`` for the recipe's
``expr_gold_metric`` extraction.

Built on the v2 ``BenchmarkRun`` API (post-PR-#912) and on the AIP-878
test harness conventions: ``make_benchmark_run`` for fixtures,
``BenchmarkProblem``-driven assertions, ``patch`` on
``aime24.load_dataset`` for deterministic rows. The loader has no
heavy optional dependency (``datasets`` is a core dep), so no
fake-harness is needed; CI gets 100% line + branch coverage out of
the box.

Updates the stub registry: drop ``aime24`` from
``test_accuracy_config.STUB_BENCHMARKS``, drop the ``is_implemented:
false`` flag from the ``aime24`` plugins.yaml entry, switch
``default_grader`` to ``lighteval_expr``, add an ``aime24`` row to
``docs/accuracy/accuracy-benchmarking.md``, and move it from "Still
Stubbed" to "Implemented" in ``accuracy_stubs.md`` (refreshing the
Status Summary, Method Count Summary, and Suggested Implementation
Order sections accordingly).

Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2 participants