feat: YAML-native v2 config + adaptive sweep orchestrator with BO & search recipes by ajcasagrande · Pull Request #912 · ai-dynamo/aiperf

ajcasagrande · 2026-05-11T19:08:47Z

Summary

Sweep orchestrator port with a new pluggable executor (orchestrator/executor.py, local_executor.py) and a search_planner/ package: Bayesian (BoTorch GP/DSP kernel), Optuna (multi-objective), Monotonic, and Smooth-Isotonic planners, with shared helpers for cliff detection, margin normalization, replicate budget, pooled percentile, and SLA constraints. Aggregation gains SLA filtering and multi-objective search-history export.
YAML-driven AIPerf config (src/aiperf/config/) replaces the old common/config/ package. Layered design: typed models, loader (Jinja2 templating, env-var interpolation, dotted-path overrides, duration parsing, strict-undefined plan), flags converter (CLI ↔ YAML), resolution layer, sweep DSL (grid, QMC/Sobol, adaptive, multi-run, distributions), public JSON schema, plus a bundled library of 20+ ready-to-run templates and reference trace data.
Search recipes (src/aiperf/search_recipes/) with built-ins: max_concurrency_under_sla, max_goodput_under_slo, sla_breach_knee, itl_surface_fit, ttft_curve_fit, and Pareto sweep (axes, dominance, export, parser) plus post-process hooks.
CLI runner refactored into _cli_runner_{helpers,sweep_helpers,post_process}.py + _sweep_table_logger.py; new aiperf config command for template discovery and validation.
Auto-plot envelope (plot/auto_plot.py) materializes the resolved plot config into the artifact dir so aiperf plot <dir> reproduces. PlotEnvelopeConfig lets one YAML own its visualization.
Finite/NaN invariants: new common/finite.py (FiniteFloat, scrub_non_finite, nan_safe_mean/std, is_finite_value); property-test corpus with ratcheted baselines in tests/unit/property/.
Mock-server scheduler for deterministic latency simulation plus robustness/scheduler test suites.
Plugin schema extended with orchestrator categories (executor, planner, recipe).
Docs: new tutorials (sweeps, adaptive search, auto-plot, inline datasets, YAML config, YAML distributions), dev docs (sweep orchestrator design, global invariants, YAML config roadmap), sweeping reference (Bayesian optimization, search recipes, space-filling), troubleshooting/sweeps, and API reference for search history. Regenerated CLI and env-vars docs.
Tests: extensive new unit/component/integration coverage under tests/unit/{config,orchestrator,search_recipes,search_planner,cli_runner,property}/ and adversarial chaos scripts under tests/scripts/chaos/.

766 files changed (+111246, -24748). Full design write-up at docs/dev/sweep-orchestrator.md.

Architecture

One pipeline at every cardinality

A single benchmark, a multi-run for confidence intervals, a grid/scenarios sweep, a Sobol/LHS characterization, an adaptive BO search, and (coming soon) a cluster-distributed BO search are seven cardinalities of one pipeline. BenchmarkPlan describes what to run, MultiRunOrchestrator decides when and in what order, an optional SearchPlanner decides what to try next, and a RunExecutor decides how to actually run one cell.

%%{init: {'flowchart': {'nodeSpacing': 50, 'rankSpacing': 60, 'htmlLabels': true}, 'themeVariables': {'fontSize': '14px'}}}%%
flowchart LR
    cfg["**config**<br/>AIPerfConfig"]
    exp["**expand**<br/>into N variations<br/>(BenchmarkPlan)"]
    run["**run**<br/>each variation<br/>M trials<br/>(via RunExecutor)"]
    agg["**aggregate**<br/>SweepAnalyzer<br/>-> sweep_aggregate/"]

    cfg --> exp --> run --> agg

    subgraph BACKENDS["RunExecutor backend (swap point)"]
        local["LocalSubprocessExecutor<br/>(today)"]
        k8s["K8sChildJobExecutor<br/><i>coming soon</i>"]
    end
    run -. selects .-> local
    run -. selects .-> k8s

    classDef stage fill:#e3f2fd,stroke:#1565c0,stroke-width:1.5px,color:#0d47a1
    classDef shipping fill:#e8f5e9,stroke:#2e7d32,stroke-width:1.5px,color:#1b5e20
    classDef coming fill:#fafafa,stroke:#9e9e9e,stroke-width:1.5px,stroke-dasharray:5 3,color:#616161

    class cfg,exp,run,agg stage
    class local shipping
    class k8s coming

    style BACKENDS fill:transparent,stroke:#78909c,stroke-width:2px,stroke-dasharray:2 2

Search recipes → AdaptiveSearchSweep → planner

A user can author an AdaptiveSearchSweep directly under sweep: (low level) or pick a search_recipe plugin (high level) that builds one from a recipe + the user's existing benchmark config.

%%{init: {'flowchart': {'nodeSpacing': 50, 'rankSpacing': 60, 'htmlLabels': true}, 'themeVariables': {'fontSize': '14px'}}}%%
flowchart TB
    subgraph IN["user inputs"]
        cli["**--search-recipe NAME --param k=v**<br/>or<br/>**sweep: { type: adaptive_search, … }** in YAML<br/>or<br/>--search-space PATH:LO,HI[:KIND]<br/>--search-metric METRIC<br/>--search-stat STAT<br/>--search-direction DIRECTION<br/>--search-sla metric:stat:op:threshold (×N)"]
        uc["**AIPerfConfig.benchmark**<br/><i>(models, endpoint, phases, …)</i>"]
    end

    subgraph RECIPE["recipe layer (optional)"]
        ctx["**SearchRecipeContext**<br/><i>(benchmark_config, sla_targets,<br/>sweep_overrides)</i>"]
        rc["**SearchRecipe** plugin (Protocol)<br/><i>built-ins:</i><br/>max-throughput-ttft-sla<br/>max-throughput-itl-sla<br/>concurrency-ramp<br/>prefill-ttft-curve / decode-itl-curve<br/>max-goodput-under-slo<br/>max-concurrency-under-sla<br/>pareto-sweep"]
        out["**SearchRecipeOutput**<br/><i>(exactly one of:<br/>adaptive_search | sweep_parameters | scenarios)</i><br/>+ sla_filters, slos, post_process"]
    end

    subgraph CFG["adaptive sweep variant"]
        asc["**AdaptiveSearchSweep**<br/><i>(SweepConfig variant,<br/>type=adaptive_search)</i><br/>search_space, objectives,<br/>max_iterations, sla_filters,<br/>post_process, planner, …"]
    end

    subgraph DRIVE["runtime drivers"]
        plan["**AIPerfConfig.sweep**<br/>= AdaptiveSearchSweep"]
        plan2["**BenchmarkPlan.sweep**<br/><i>(is_adaptive_search is true)</i>"]
        sp["**SearchPlanner** plugin<br/><i>(BayesianSearchPlanner |<br/>MonotonicSLASearchPlanner |<br/>SmoothIsotonicSLAPlanner |<br/>OptunaSearchPlanner)</i>"]
        pph["**search_recipe_post_process** plugin<br/><i>(degradation_knee_detect, ttft_curve_fit,<br/>itl_surface_fit, sla_breach_knee,<br/>pareto_sweep_export)</i>"]
    end

    cli --> rc
    uc --> ctx --> rc --> out --> asc
    cli -.direct path.-> asc

    asc --> plan
    plan --> plan2
    plan2 --> sp
    plan2 --> pph

    classDef data fill:#e3f2fd,stroke:#1565c0,stroke-width:1.5px,color:#0d47a1
    classDef proc fill:#fff3e0,stroke:#e65100,stroke-width:1.5px,color:#bf360c
    classDef decision fill:#f3e5f5,stroke:#6a1b9a,stroke-width:1.5px,color:#4a148c
    classDef art fill:#e8f5e9,stroke:#2e7d32,stroke-width:1.5px,color:#1b5e20

    class cli,uc data
    class ctx,rc proc
    class out,asc art
    class plan,plan2,sp,pph decision

Adaptive search — `propose → execute → record` loop

The BO outer loop is a propose -> execute -> record cycle inside MultiRunOrchestrator.execute_adaptive_search. BenchmarkRun and RunExecutor are unchanged from the grid path; the difference is that BenchmarkPlan.configs starts with one seed config and grows by one per iteration as the planner asks for the next point.

%%{init: {'sequence': {'mirrorActors': false}, 'themeVariables': {'fontSize': '14px'}}}%%
sequenceDiagram
    autonumber
    participant Plan as BenchmarkPlan<br/>(sweep is AdaptiveSearchSweep)
    participant Orch as MultiRunOrchestrator
    participant Pl as SearchPlanner<br/>(Bayesian / Monotonic / Optuna)
    participant Run as BenchmarkRun
    participant Exec as RunExecutor
    participant Res as RunResult
    participant PP as PostProcessHandler
    participant Out as search_history.json /<br/>sweep_aggregate

    Orch->>Pl: planner instantiated upstream<br/>(via _build_search_planner)<br/>and passed into execute

    loop until converged or max_iterations
        Orch->>Pl: ask
        Pl-->>Orch: (BenchmarkConfig_k, SweepVariation_k)<br/>or None (converged -> convergence_reason)
        alt got proposal
            Orch->>Orch: _run_independent_cell<br/>(fresh ExecutionStrategy per cell)
            loop trials inner (until strategy says stop)
                Orch->>Run: BenchmarkRun for cfg_k, variation_k, trial t, …
                Orch->>Exec: run the cell
                Exec-->>Res: RunResult
            end
            Orch->>Pl: tell with variation_k, cell_results
            Pl->>Pl: filter by SLAFilter,<br/>compute objective scalar,<br/>plateau / patience / max-iter check
            Orch-->>Out: write_search_history<br/>(incremental, includes<br/>boundary_summary if planner has it)
        end
    end

    Orch->>PP: process the sweep_aggregate with params<br/>(per PostProcessSpec on sweep)
    PP-->>Out: knees, curve fits, …
    Orch-->>Out: profile_export_aiperf_sweep.{json,csv}

Test plan

🤖 Generated with Claude Code

github-actions · 2026-05-11T19:09:00Z

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@c357b7c48b1ec429317782209820491f88ecf396

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@c357b7c48b1ec429317782209820491f88ecf396

Last updated for commit: c357b7c • Browse code

github-actions · 2026-05-11T19:09:21Z

Fern Docs Preview: https://nvidia-preview-fa7f7392-df72-4046-97b6-fdca492f9fef.docs.buildwithfern.com/aiperf/dev

…g, search recipes Major reorganization that ports the parameter-sweep orchestrator into a first-class subsystem, introduces a full YAML-driven AIPerf configuration language alongside the existing CLI, and ships a library of reusable "search recipes" plus a Bayesian/adaptive search-planner stack. Orchestrator - New executor abstraction (`orchestrator/executor.py`, `local_executor.py`) decouples per-cell run execution from the orchestrator loop. - New `search_planner/` package with pluggable planners: - `bayesian.py` + `_botorch_kernel.py` (BoTorch GP / DSP kernel) - `optuna_planner.py` + `_optuna_helpers.py` (multi-objective via Optuna) - `monotonic.py` + `_monotonic_boundary.py` - `smooth_isotonic.py` + `_smooth_isotonic_{fit,boundary,phases}.py` - Shared helpers: cliff detection, margin normalization, replicate budget, pooled percentile, SLA helpers, outcome constraints. - Sweep aggregation grows `sweep_sla_filter.py` and multi-objective search-history export (`exporters/search_history.py`). - New convergence strategy hooks, JSONL loader, subprocess runner refactors, and per-cell callbacks. YAML configuration (`src/aiperf/config/`) - Replaces the old `src/aiperf/common/config/` package with a layered design: typed config models, loader (Jinja2 templating, env-var interpolation, dotted-path overrides, duration parsing, normalizers, strict-undefined plan), flags converter (CLI <-> YAML), resolution layer (predicates + resolvers + plan), sweep DSL (grid, QMC/Sobol, adaptive, multi-run, sampling, distributions), and a public JSON schema (`config/schema/aiperf-config.schema.json`). - Bundled template library (`config/templates/`): 20+ ready-to-run YAMLs covering minimal, latency, goodput SLO, long context, ramping, multi-turn, multimodal vision/audio, embeddings, fixed schedule, trace replay, KV cache test, multi-URL load balancing, sweep with plot, sweep distributions, warmup profiling, request cancellation, Jinja2 variables, env-var production, inline dataset, scenario workload profiles, GPU telemetry, HTTP trace metrics, user files, speed bench sweep, plus reference trace JSONL data. - Communications config split into `comm/` (TCP, IPC, dual-bind, build). - Dataset config split into `dataset/` (content, resolver, trace, video) with inline-record support. Search recipes (`src/aiperf/search_recipes/`) - New recipe registry with built-ins: `max_concurrency_under_sla`, `max_goodput_under_slo`, `sla_breach_knee`, `itl_surface_fit`, `ttft_curve_fit`, and Pareto sweep (axes, dominance, export, parser). - Recipe post-process hooks with shared infrastructure. CLI / runner - `cli_runner.py` factored into `_cli_runner_helpers.py`, `_cli_runner_sweep_helpers.py`, `_cli_runner_post_process.py`, plus `_sweep_table_logger.py` for live progress rendering. - New `aiperf config` command for template discovery and validation. - Profile/plot/service commands updated for the new config layer. Auto-plot envelope - `plot/auto_plot.py` materializes a resolved plot envelope into the artifact dir so `aiperf plot <dir>` reproduces the chart pipeline. - `PlotEnvelopeConfig` allows a single AIPerf YAML to own its visualization. Finite / numeric invariants - New `common/finite.py` with `FiniteFloat`, `scrub_non_finite`, `nan_safe_mean`, `nan_safe_std`, `is_finite_value`. - Property test corpus (`tests/unit/property/`) with field/bounds baselines, finite invariants, Pydantic field fuzz, and config-dump round-trip checks. CI ratchets these to zero. Metrics - New `good_request_fraction_metric` for goodput SLO recipes. - Records, exporters, and post-processors threaded with redaction and finite-value scrubbing on export boundaries. Mock server (`tests/aiperf_mock_server/`) - New deterministic `scheduler.py` for repeatable latency simulation. - Robustness and scheduler test suites for CI coverage. Plugins - New orchestrator plugin categories and schema (`plugin/schema/_orchestrator_schemas.py`, `plugin/categories.yaml` updates) for executors, planners, recipes. Testing - Extensive new unit/component/integration suites under `tests/unit/{config,orchestrator,search_recipes,search_planner, cli_runner,property}/` and adversarial chaos scripts under `tests/scripts/chaos/`. - New component-integration smoke tests for Sobol sweeps, multi-objective E2E, process-title, and recipe collapse-knee. Docs - New tutorials: sweeps, adaptive search, auto-plot, inline datasets, YAML config, YAML distributions. - New developer docs: sweep orchestrator design, global invariants, YAML config roadmap and future goals. - New sweeping reference: bayesian optimization, search recipes, space-filling designs; new troubleshooting/sweeps guide. - New API reference: search history. - Regenerated `docs/cli-options.md` and `docs/environment-variables.md`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

Three issues caused 'Validate (and publish on main) synced docs' to fail: 1. `environment-variables.md:148` had a literal `{"tight": 20000}` JSON example in prose. MDX parses `{...}` as a JSX expression and `"tight": 20000` is not a valid JS expression. Fixed by wrapping the override example in `` `` `` inline code in the source description string in `common/environment.py` (auto-generates the doc). 2. `dev/sweep-orchestrator.md:443` had a literal `<= max_iter` in a table cell. MDX sees `<` as the start of a JSX element and chokes on `=`. Replaced with the Unicode `≤` (already used elsewhere in the same doc for `×`). 3. Three relative `../../src/...` links to source files broke under `--strict-broken-links` because Fern publishes from `fern/pages-dev/` which has no `../../src/` parent. Converted to absolute `github.com/ai-dynamo/aiperf/blob/main/...` URLs, matching the pattern already used in `docs/api/search-history.md` and `docs/reproducibility.md`. Verified locally by cloning the `docs-website` branch, syncing the PR's `docs/` into `fern/pages-dev/`, running `md_to_mdx.py`, and getting `fern check --warnings --strict-broken-links` to 0 errors. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

CI pre-commit `test-imports` hook was failing with: aiperf.orchestrator.search_planner._botorch_kernel: ModuleNotFoundError("No module named 'torch'") torch/gpytorch live behind the `[optuna]` extra and are not installed in the lint/pre-commit env. The module was importing them at top level even though only `make_dsp_kernel` uses them. Moved the imports inside the function and kept a TYPE_CHECKING import for the `ScaleKernel` return annotation. Module now imports cleanly without the extra; calling `make_dsp_kernel` without it still raises a clear ModuleNotFoundError, matching the existing optuna-gated UX. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

…tests `tests/unit/config/test_v1_file_dataset_rejections.py` hardcoded ``input_file="/tmp/mc.jsonl"`` with a comment claiming "path doesn't have to exist for converter". That's false: ``CLIConfig.input_file`` runs the ``parse_file`` validator in ``src/aiperf/config/loader/parsing.py`` which requires the path to be an existing file or directory. The tests only passed on dev machines where ``/tmp/mc.jsonl`` happened to exist from prior runs; CI fails all 7 cases with ``ValidationError: '/tmp/mc.jsonl' is not a valid file or directory``. Switched to a per-test ``mc_jsonl`` fixture that creates an empty JSONL under pytest's ``tmp_path``. The converter only reads the *path*, not the file contents, so an empty file is sufficient. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

Keep the four agent instruction files synchronized while moving detailed finite-value guidance to the canonical docs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

debermudez

This is a thorough, well-structured architectural overhaul. The new MultiRunOrchestrator, v2 config system, and adaptive search planner are solid. The issues below are mostly low-severity cleanup items, with one medium-priority data-loss risk in the adaptive cancel path.

13 findings: 0 critical, 0 high, 1 medium, 9 low, 3 nit.

Merge the latest mainline changes into the sweep orchestrator branch while keeping the branch's config-v2 model authoritative. OTel and MLflow now live as first-class benchmark config groups with runtime call sites using native nested access only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

Drop the BenchmarkConfig flat-forwarding properties so call sites read cfg.artifacts, cfg.mlflow, cfg.otel, and run.benchmark_id directly. Tests build real BenchmarkConfig / BenchmarkRun instead of CLIConfig DTO mocks. The OTel fanout subprocess now consumes the native MLflowConfig. Split src/aiperf/config/artifacts.py into one section per file matching the benchmark fields: mlflow, otel, server_metrics, gpu_telemetry. Same for TokenizerConfig, LoggingConfig, and SLOsConfig out of models.py and runtime.py. External imports through aiperf.config are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

…ugin metadata Replace the three hard-coded dicts in user_config.py (_LOCAL_COLLECTOR_KEYWORDS, _LOCAL_ONLY_COLLECTORS, _LOCAL_COLLECTOR_INSTALL_HINTS) with a typed metadata schema on the gpu_telemetry_collector plugin category. Adding a new local collector now only requires editing plugins.yaml: `is_local: true` plus an `install_hint`. Module name defaults to the plugin name when `import_module` is unset. - New `GPUTelemetryCollectorMetadata` Pydantic class (is_local, import_module, install_hint) wired via `metadata_class` in categories.yaml. - `get_gpu_telemetry_collector_metadata` helper + `_CATEGORY_METADATA_CLASSES` registration in plugin/plugins.py, matching the existing get_endpoint/plot/ service_metadata pattern. - Metadata populated for pynvml and amdsmi (dcgm relies on `is_local: false` default). - user_config helpers `_local_collector_keywords`, `_is_local_collector`, and `_ensure_local_collector_importable` consult plugin metadata. The "Invalid GPU telemetry item" error message self-derives from the keywords dict, so a new local collector flows through with no edits to error text. - New test `test_local_collector_discovered_dynamically_from_plugin_metadata` registers a fake collector via `mock_plugin` (and a matching enum extension) to prove selection, conflict detection, local-vs-URL guardrail, and error message derivation all generalize beyond pynvml/amdsmi. Behavior preserved: every existing test passes because `str(GPUTelemetryCollectorType.PYNVML) == "pynvml"`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

Move each *Defaults dataclass next to the config section it belongs to: EndpointDefaults, OutputDefaults, MLflowDefaults, TokenizerDefaults, ServiceDefaults into endpoint/artifacts/mlflow/tokenizer/runtime modules. Dataset and prompt modality defaults move to dataset/defaults.py. The aggregator src/aiperf/config/defaults.py is deleted. External imports through aiperf.config are unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

Drive local-collector classification from plugin metadata (GPUTelemetryCollectorMetadata) so adding a new local collector only requires editing plugins.yaml. The original PR refactored the v1 aiperf.common.config.user_config module that no longer exists on this branch; the equivalent metadata-driven validation now lives on GpuTelemetryConfig (local-vs-URL guardrail and install-hint surfacing) and the CLI converter derives local keywords from plugin metadata. The AMD ROCm amdsmi collector ships with this merge. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

Medium: - execute_adaptive_search: cancel path now writes search_history.json before returning (previously left stale history after cancellation). Factored a `_flush_history` closure to share the write across the cancel, converged, and per-iteration paths. Low / nit cleanup: - Add `SearchPlanner.iter_count` property; orchestrator no longer reads the private `planner._iter`. - Remove dead `_build_convergence_criterion` copy in orchestrator.py (canonical version lives in `_cli_runner_helpers`). - Drop the duplicate `_plan_iteration_order` in orchestrator.py; import the canonical one from `_cli_runner_sweep_helpers`. - Remove dead `LocalSubprocessExecutor._write_redacted_config` (`EndpointConfig.api_key` field_serializer already redacts). - Replace `# noqa: ANN202` placeholders with concrete return types on `_build_convergence_criterion`, `_build_search_planner`, `_maybe_compute_detailed`, and `_setup_ui_queues`. - `_log_failed_sweep_variations`: extract a single `_format_key` helper and stop double-formatting the key string between the warning summary and the per-run loop. - `_summarize_and_export`: run per-variation and sweep-aggregate exports concurrently under one `asyncio.run(gather(...))` instead of two sequential `asyncio.run` calls. - Distinguish "1 successful run" vs "1 variation succeeded" wording in the aggregate-summary warning so sweep users see accurate language. - Collapse `FixedTrialsStrategy._sanitize_label` into the module-level helper it duplicated verbatim. - Remove the unused `_scrub_non_finite` shim in `_cli_runner_post_process`; the one test caller now imports `scrub_non_finite` from `aiperf.common.finite`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

…kage Reviewer feedback flagged the leading-underscore-at-package-root layout as unusual. Convert `cli_runner.py` into a package and move the three `_cli_runner_*` helpers plus `_sweep_table_logger.py` in as private submodules. No behavior change; pure relocation + import rewrites. Mapping: - src/aiperf/cli_runner.py -> src/aiperf/cli_runner/__init__.py - src/aiperf/_cli_runner_helpers.py -> src/aiperf/cli_runner/_helpers.py - src/aiperf/_cli_runner_post_process.py -> src/aiperf/cli_runner/_post_process.py - src/aiperf/_cli_runner_sweep_helpers.py -> src/aiperf/cli_runner/_sweep_helpers.py - src/aiperf/_sweep_table_logger.py -> src/aiperf/cli_runner/_sweep_table_logger.py All imports in src/, tests/, docs/, and the ruff/ergonomics baselines updated to match. Pre-commit and the full 11,736-test unit suite pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

`cli_runner/__init__.py` was 790 lines (grandfathered above the 500-line ergonomics cap). Split the obvious independent concerns into siblings; __init__.py keeps the orchestrating entry points + the helpers that need to share the module namespace with test mock targets. New layout: - _callbacks.py CompletedRun, OnComplete, _invoke_callbacks - _preflight.py _preflight_artifact_dir/_fd_limit/_endpoint_ready - _process_setup.py mp start method, log queue, FD_CLOEXEC, tokenizer preload - _single_run.py _run_single_benchmark - _failure_summary.py _log_failed_sweep_variations - __init__.py run_benchmark + multi-run orchestration + helpers __init__.py drops 790 -> 409 lines, falling out of the ergonomics baseline. All eleven `patch("aiperf.cli_runner.<name>")` targets remain reachable through __init__.py imports. test_cli_runner_macos.py imports the _process_setup helpers from their new path directly. 11,736 unit tests pass; all pre-commit hooks pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

Previous layout was organized by 'what kind of helper' (preflight, callbacks, process-setup) with two grab-bag files (_helpers.py, _sweep_helpers.py) and an __init__.py that mixed dispatch with multi-run orchestration. This pass organizes by the package's three real domains: execution, aggregation, display. Deleted: _helpers.py distributed across _strategy/_aggregate/_banner _failure_summary.py folded into _multi_run.py (its only caller) Renamed: _sweep_helpers.py -> _sweep_aggregate.py _sweep_table_logger.py -> _sweep_table.py Created: _strategy.py build_strategy, _build_convergence_criterion, _build_search_planner, validate_convergence_config _aggregate.py aggregate_and_export, print_aggregate_summary, _maybe_compute_detailed, priority-metric block printers _banner.py log_multi_run_banner, _log_search_planner_active _multi_run.py _run_multi_benchmark + _execute_multi_benchmark + _summarize_and_export + _estimate_and_log_duration + _validate_multi_benchmark_plan + _reject_in_process_sweep_under_operator + _log_failed_sweep_variations (was _failure_summary) __init__.py drops 415 -> 120 lines and now contains only the public surface (run_benchmark, CompletedRun, OnComplete, _make_benchmark_run) plus re-imports of the run_benchmark-layer patch targets (_preflight_*, _run_single_benchmark, _run_multi_benchmark) so existing dispatch tests keep working without touching their patches. Test patches for multi-run internals (aggregate_and_export, _estimate_and_log_duration, _summarize_and_export, build_strategy, _build_search_planner, _log_search_planner_active) now correctly target the call site (`aiperf.cli_runner._multi_run.<name>`) following the standard 'patch where it's looked up' rule. ruff_baseline.json and ergonomics_baseline.json updated for the new paths; docs/troubleshooting/sweeps.md and docs/dev/sweep-orchestrator.md updated likewise. 11,736 unit tests pass; all pre-commit hooks pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

Move pareto-axes resolution and per-cell pareto projection out of _sweep_aggregate.py into a dedicated _pareto.py: _resolve_pareto_axes plugin-registry recipe lookup for pareto_axes _extract_axis_value pull axis value from per-cell stats with fallbacks _aggregate_one_cell project one variation's runs into a Pareto cell _sweep_aggregate.py drops 775 -> 643 lines and now contains only the per-variation + sweep-wide aggregation pipelines (no per-cell pareto math). Lazy import inside _aggregate_one_cell avoids the cycle with _sweep_aggregate's top-level _resolve_pareto_axes import. External callers updated: - aiperf.orchestrator.orchestrator._fire_cell_callback - aiperf.cli_runner._sweep_table.SweepTableLogger - tests/unit/test_aggregate_one_cell.py - ruff_baseline.json (BLE001 entry path) Also strip refactor-provenance docstrings and comments across the codebase ("Lifted out of X", "Extracted from Y to keep that module under the 500-line cap", "Factored out of Z so the helper exists", etc.). The git history is the right place for that information; in the code it's noise. Touched ~20 files in cli_runner/, search_recipes/, config/, orchestrator/, plugin/. 11,736 unit tests pass; all pre-commit hooks pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

`_invoke_callbacks` and `_print_aggregate_summary` were re-exported at the package surface solely so tests could import them at `aiperf.cli_runner.<name>`. They have no production callers outside the package. Move both off the public surface; update the two test files that imported them to read from `_callbacks` and `_aggregate` directly. `_print_aggregate_summary` was also an unnecessary rename of `print_aggregate_summary` from `_aggregate` (the underscore prefix was left over from when everything lived in a single `cli_runner.py`); tests now use the real name. 11,736 unit tests pass; all pre-commit hooks pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

Drive GPU telemetry collector setup through a shared candidate loop so DCGM and local collectors follow the same probe, baseline, and status flow while plugin metadata only handles config-time local classification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

Three concrete fixes: 1. `__init__.py` `__all__` listed `_run_single_benchmark` and `_run_multi_benchmark` alongside public names. Underscore-prefix says "private" and __all__ says "public" — pick one. Both stay importable for tests; they're just no longer advertised as public API. 2. `_multi_run.py` had a `_ = (CompletedRun, OnComplete)` line that claimed to suppress an unused-import warning. Both names are actually used (CompletedRun constructor at line 104, OnComplete in two function signatures). The line was a leftover; remove. The trailing `__all__ = ["_run_multi_benchmark"]` also went — Python's default "names without leading underscore are public" makes it redundant here. 3. `__init__.py` docstring listed every helper submodule by name. Each module has its own docstring; this index was just maintenance burden. Trim to the actual public surface. No behavior change; no test patches updated. 11,736 unit tests pass; all pre-commit hooks pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

Three private helpers had Any-typed parameters where the call-site type is statically known: - _log_search_planner_active(search_planner: Any, logger: Any) in _banner.py becomes (SearchPlanner | None, AIPerfLogger). The caller in _multi_run._execute_multi_benchmark already passes those types. - _print_metric_block(metric, ...) in _aggregate.py becomes (metric: Any, ...) so the function has a complete signature (the upstream AggregateResult.metrics dict is dict[str, Any], so Any is the honest type here). - _aggregate_one_cell(cell_results: list[Any], plan: Any, variation: Any) in _pareto.py becomes (list[RunResult], BenchmarkPlan, SweepVariation). Both callers (orchestrator._fire_cell_callback and aggregate_sweep_and_export) already pass those types. No behavior change; pure annotation tightening. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

Drop the underscore prefix on a function that is genuinely shared across packages: cli_runner.run_benchmark, cli_commands.service, and orchestrator.orchestrator (its own caller) all use it. Reaching into another module for a leading-underscore "private" function is a smell; the public name matches its actual usage. Touches every call site (3 prod paths) plus 3 docstring/comment references in tests and config/loader/plan.py. No behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

debermudez

Good to go once its rebased on main. Very nice work!

…ONL routing - OTel: `--stream` now accepts a list; add `--otel-resource-attributes` with key=value parsing - MLflow: rename to singular `--mlflow-tag` / `--mlflow-artifact-glob`; parse tags into a dict; surface schema-1.1 `count`/`sum` size fields in the exporter and the sweep aggregator - Accuracy: `--accuracy-n-shots` becomes `int | None` (cap 32, defers to benchmark default); `--accuracy-enable-cot` becomes tri-state; tasks accept comma-separated lists - Endpoint: prepend `http://` to schemeless URLs via AfterValidator; timeout now reads from `EndpointDefaults.TIMEOUT` - Dataset: wire `DAG_JSONL` through the resolver and custom composer format map - Server metrics: drop `parquet` from default formats - Records pipeline: `OutputsJsonRecordProcessor` takes `run` (BenchmarkRun) instead of bare `cfg`; `RecordsManager` reads `self.run.cfg.otel` - Schema: regenerate `aiperf-config.schema.json` with otel/mlflow sections and `image_edit` endpoint enum Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

The bulk of the v1 user_config.py changes from #941 (plugin-driven local-collector keyword detection, validate_environment on collector classes, generic conflict-detection wording) are already pre-migrated into this branch's new src/aiperf/config/ package — see GpuTelemetryConfig.validate_collector_compatibility() in src/aiperf/config/gpu_telemetry.py and the warning hook in src/aiperf/config/flags/_converter_telemetry.py. Manual resolutions: - src/aiperf/common/config/user_config.py: deleted (modify side from main is already represented in the new config/ package). - src/aiperf/gpu_telemetry/manager.py: adopt main's _collector_candidates / _configure_reachable_collectors / _capture_collector_baseline plugin- dispatch design; keep branch's BenchmarkRun-driven constructor and gpu_telemetry_cfg.* access pattern; drop the legacy _configure_pynvml_collector / _configure_amdsmi_collector / _configure_dcgm_collectors helpers. - tests/unit/common/config/test_user_config.py: take branch (legacy v1 CLIConfig smoke tests only; UserConfig-validator tests from main no longer apply since the v1 module is gone). - tests/unit/gpu_telemetry/test_telemetry_manager.py: keep both imports (make_run_from_cli from branch, mock_plugin from main). - src/aiperf/plugin/schema/{schemas.py,plugins.schema.json,plugins.py}: keep branch's orchestrator metadata + main's GPUTelemetryCollectorMetadata side-by-side. - tools/ergonomics_baseline.json: union of both file-size entries. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`normalize_http_url` checked only for ``://`` to decide whether to prepend ``http://``, so a bare ``scheme:opaque`` form like ``javascript:alert(1)`` or ``data:text/plain;base64,xyz`` was silently rewritten to ``http://javascript:alert(1)`` and then either: - rejected with the wrong message ("invalid port" instead of "missing scheme or host"), or - SILENTLY ACCEPTED when the opaque part happened to be all digits — ``javascript:1234`` became ``http://javascript:1234`` with host=``javascript``, port=1234, a real validation bypass. Leave the URL alone when the colon prefix is a recognized foreign URI scheme (javascript, data, file, ftp, ftps, sftp, ssh, gopher, ldap[s], mailto, tel, vbscript, ws, wss) so the downstream EndpointConfig validator can reject it as "missing scheme or host". ``localhost:8000`` and ``host:port`` shorthand still work because they don't match a known foreign scheme. Fixes the canary test ``tests/unit/transports/test_build_url_adversarial.py:: test_endpoint_validator_rejects_garbage[javascript-scheme]``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The new DatasetResolver had two bugs that wedged sagemaker_data_capture runs with --fixed-schedule before the loader ever ran: - _check_timing_data only checked top-level `timestamp`/`delay` keys, so sagemaker records (timing under `eventMetadata.inferenceTime`) were flagged as having no timing data. Add a per-type branch and pass `dataset_type` through. - _resolve_one skipped structural auto-detection whenever `ds.format` was truthy, but Pydantic defaults `format` to SINGLE_TURN. Result: the resolver pre-validated against SINGLE_TURN when the user relied on auto-detect. Use `model_fields_set` so detection runs unless the user explicitly set `format`, matching how the composer infers type at load time. Fixes the 5 sagemaker integration tests in tests/integration/test_sagemaker_data_capture.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Ports tools/generate_config_schema.py from ajc/k8s-rework so the hand-maintained aiperf-config.schema.json can be regenerated from the Pydantic models. Adds `make generate-config-schema` / `make check-config-schema` targets and a `generate-config-schema` pre-commit hook that regenerates on AIPerfConfig changes. Also folds in stray cleanups picked up while touching this area: - CustomDatasetComposer._format_to_loader_type: replace hand-maintained dict with direct CustomDatasetType(fmt.value) — both enums mirror the custom_dataset_loader plugin registry and share string values. - test_dag_timing_pathology: fix British→US spellings flagged by codespell. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Brings in 6 main PRs: #710 (--image-source flag, multimodal validator hardening), #945 (--session-header), #884 (Agentic Code dataset docs), #826 (UTF-8 for text/JSON file reads), #942 (per-request `extra` payload), #943 (agentic coding dataset files). The v1 ``src/aiperf/common/config/`` modules were deleted on this branch as part of the config v2 restructure, so the corresponding modify/delete conflicts keep the deletes and the relevant main-side features are ported into the v2 surface: - ``--image-source`` (PR #710): adds an ``ImageSource | Path`` field with ``BeforeValidator`` coercion to ``aiperf.config.dataset.content.ImageConfig`` plus an ``images_enabled()`` helper. ``ImageGenerator`` dispatches per source mode (ASSETS, NOISE, custom Path) matching main's behavior. - ``--session-header`` (PR #945): adds the field to ``EndpointConfig``, routes it through ``_converter_endpoint`` / ``_section_fields`` / ``CLIConfig`` so the flag round-trips into ``EndpointInfo.session_header`` (already wired through ``base_transports``). UTF-8 fix from PR #826 is applied to ``BaseFileLoader._iter_record_dicts`` so the encoding fix flows through every loader that uses the helper. Test conflicts: - ``tests/unit/common/config/test_user_config.py``: kept the v2-flavor smoke tests (the v1 alternate body tested deleted v1 modules and was not portable). - ``tests/unit/dataset/generator/test_image_generator.py``: kept the v2 ``make_image_config`` helper, added a ``source`` knob, aliased the v1 ``ImageWidthConfig`` / ``ImageHeightConfig`` to ``NormalDistribution`` so PR #710's new test classes (Noise mode, custom directory, disabled) read naturally, and added ``batch_size=1`` to the ImageConfig sites those tests construct directly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…d --image-source Doc drift from prior commits on the branch; regenerated by the generate-cli-docs pre-commit hook. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

… env-var/Jinja in FixedDistribution shorthand `${VAR}` whole-string substitutions now coerce to bool/int/float using the same rules as Jinja, so `isl: ${AIPERF_TEST_ISL}` resolves to a numeric distribution scalar. The schema generator emits matching string-pattern branches under `FixedDistribution` so YAMLs using these forms validate cleanly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

The strict-xfail sanity test for Responses-endpoint `max_tokens=0` emission was a tripwire that no longer adds signal — the None-check semantics are exercised by the surrounding positive tests. Removing the inverted assertion to keep the suite focused on direct behavioural assertions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

for more information, see https://pre-commit.ci

copy-pr-bot · 2026-05-15T19:42:43Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

`tests/unit/config/test_config_schema_generator_integration.py` imports `jsonschema.Draft202012Validator` but the dep was never declared — it was resolved transitively in local envs, so CI's test-imports check fails with `ModuleNotFoundError: No module named 'jsonschema'`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>

PR #912 (commit 94a9102) rewrote tokenizer_validator.py and introduced a ProcessPoolExecutor-based HF cache prefetch in validate_tokenizer_early. The skip-prefetch gate ANDed HF_HUB_OFFLINE and TRANSFORMERS_OFFLINE, but the component_integration conftest only set HF_HUB_OFFLINE, so the gate never engaged and the prefetch subprocesses ran. Those subprocesses bypass the in-process Tokenizer.from_pretrained patch (subprocess re-imports HF cleanly) and try to write to the real HF cache. In restricted environments (Linux CI containers, sandboxes) or under concurrent test execution that races on the cache directory, the write fails with EPERM and aiperf aborts with "Configuration resolution failed: [Errno 1] Operation not permitted" -- which surfaced as `request_count == 0` in every component_integration test that runs `aiperf profile ...` on Linux CI (ubuntu-latest and the new builder pool). This has been the cause of run-unit-tests failures on main since 2026-05-15. - src/aiperf/common/tokenizer_validator.py: change the skip-prefetch gate from AND to OR. Either env var being set is enough -- both mean "I have a warm cache, do not touch the network/disk." Requiring both was overly conservative and is what masked the test-harness bug. - tests/component_integration/conftest.py: hf_offline_mode fixture now sets both HF_HUB_OFFLINE and TRANSFORMERS_OFFLINE so the prefetch skip-gate engages even on older aiperf builds where the gate is still ANDed. Restoring prior values for both on teardown. Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>

…ked (AIP-877) Adds HellaSwag commonsense reasoning benchmark + the strict ExactMatchGrader that DeepEval's ``Scorer.exact_match_score`` uses. Both align byte-for- byte with the trt-llm benchmark recipe's DeepEval-backed configuration. **HellaSwagBenchmark** (`src/aiperf/accuracy/benchmarks/hellaswag.py`) - Loads ``Rowan/hellaswag``: validation split filtered per task by ``activity_label``, train split feeds the "one few-shot per unique activity_label" rule (mirrors ``deepeval.benchmarks.HellaSwag``'s ``categories_seen`` dedupe loop). - Prompt rendering delegates to ``deepeval.benchmarks.hellaswag.template.HellaSwagTemplate.generate_output`` — output is byte-equal to what the trt-llm recipe ships. - Defaults: ``n_shots=10`` (DeepEval cap is 15), ``generation_size=5``, ``default_grader=exact_match``. - Ground truth: bare ``A``/``B``/``C``/``D`` letter (DeepEval's convention for ``Scorer.exact_match_score``). - ``_resolve_tasks`` matches activity labels case-insensitively via a lowercased-value map; falls back to upper-snake-case enum name (``HellaSwagTask.APPLYING_SUNSCREEN`` form) for the recipe's ``getattr(HellaSwagTask, name.upper(), None)`` parity. **ExactMatchGrader** (`src/aiperf/accuracy/graders/exact_match.py`) - Strict ``pred.strip() == gold.strip()`` semantics matching DeepEval's ``Scorer.exact_match_score``: case-sensitive, no normalization, empty response → ``unparsed=True``. - Used by HellaSwag and (in AIP-878) BigBench-Hard for reference parity. **Plugin registration** (`src/aiperf/plugin/plugins.yaml`) - ``hellaswag`` → ``default_grader: exact_match``, ``default_n_shots: 10`` with the DeepEval-backed description. - ``exact_match`` → strict-equality description; drops the ``is_implemented: false`` flag. **Dependencies** (`pyproject.toml`) - Adds ``deepeval>=2.9.0`` to the ``[accuracy]`` optional-dependency group. Aiperf calls DeepEval's bundled prompt template directly so the dep is required for HellaSwag (and BigBench-Hard in AIP-878). **Tests** (`tests/unit/accuracy/`) - ``test_hellaswag_benchmark.py``: ~22 tests covering DeepEval prompt byte-equality, the unique-activity-label shots set rule, validation filtering, task resolution (exact, lower, upper, mixed case), and pathological dataset rows (empty validation, unlabeled rows). - ``test_exact_match_grader.py``: strict-equality semantics including the empty-response → ``unparsed=True`` path and case-sensitivity. - ``test_accuracy_config.py``: drops ``hellaswag`` from ``STUB_BENCHMARKS`` and ``exact_match`` from ``STUB_GRADERS``; the uppercase-stub test now uses ``BIGBENCH`` (a still-stub name). **Docs** (`docs/accuracy/`) - ``accuracy-benchmarking.md``: add HellaSwag row to the benchmarks table. - ``accuracy_stubs.md``: status summary + move HellaSwag from "Still Stubbed" to "Implemented"; move ExactMatchGrader to "Implemented". **Constructor signature** Loader + grader use the v2 ``BenchmarkRun`` API (post-#912 refactor on main) rather than the legacy ``UserConfig`` shape — matches how ``MMLUBenchmark`` and ``AIMEBenchmark`` are wired on current main. Validation: - 70/70 accuracy tests pass (HellaSwag + ExactMatch + AccuracyConfig). - Ruff format + ruff check clean on all modified Python files. - Codespell clean (v2.4.2, matches CI). - HellaSwag prompts verified byte-equal against ``HellaSwagTemplate.generate_output`` on synthetic fixtures. Reference: - ``deepeval/benchmarks/hellaswag/hellaswag.py`` - ``deepeval/benchmarks/hellaswag/template.py`` - ``deepeval/scorer/scorer.py:Scorer.exact_match_score`` - ``trt-llm-benchmark-recipe/src/tools/acc_benchmark.py:319-336`` Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>

Implements the BigBench-Hard accuracy benchmark by delegating prompt rendering to ``deepeval.benchmarks.BigBenchHard``'s ``BigBenchHardTemplate.generate_output``. Output is byte-equal to the trt-llm benchmark recipe's DeepEval-backed configuration so reference parity is preserved end-to-end. Pairs with the existing ``ExactMatchGrader`` (landed via AIP-877) for the recipe's strict ``Scorer.exact_match_score`` semantics. Loader uses the new ``BenchmarkRun`` constructor signature introduced by PR #912 (no ``UserConfig``), and the test fixture wires through the ``make_benchmark_run`` conftest helper. ``deepeval`` is already pinned in the ``[accuracy]`` extras via AIP-877 — the test guards on ``pytest.importorskip("deepeval")`` so the suite still runs without the optional install. Drops ``bigbench`` from ``STUB_BENCHMARKS``, removes ``is_implemented: false`` from the ``plugins.yaml`` entry, and updates the accuracy docs to reflect the new implemented status. The uppercase-stub validator test now exercises ``LCB_CODEGENERATION`` since ``BIGBENCH`` is no longer stubbed. Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>

Implement ``AIME24Benchmark`` to mirror the trt-llm benchmark recipe's ``acc_bench_lighteval.py`` configuration for AIME 2024: aime24 = LightevalTaskConfig( name="aime24", prompt_function=aime_prompt_fn, hf_repo="HuggingFaceH4/aime_2024", evaluation_splits=["train"], few_shots_split=None, few_shots_select=None, generation_size=32768, metric=[expr_gold_metric], ) The recipe's ``aime_prompt_fn`` produces a ``Doc`` whose ``query`` is the bare problem text — lighteval's prompt manager wraps it as a single user message with no instruction prefix and no few-shot priming. The loader emits prompts the same way: one ``BenchmarkProblem`` per dataset row, ``prompt`` = the bare ``problem`` field, ``ground_truth`` = ``str(answer)``, ``metadata.generation_size`` = 32768. ``tasks`` / ``n_shots`` / ``enable_cot`` arguments are accepted for protocol uniformity but ignored (any of them changing the prompt would diverge from the reference). Pair with ``LightevalExprGrader`` for the recipe's ``expr_gold_metric`` extraction. Built on the v2 ``BenchmarkRun`` API (post-PR-#912) and on the AIP-878 test harness conventions: ``make_benchmark_run`` for fixtures, ``BenchmarkProblem``-driven assertions, ``patch`` on ``aime24.load_dataset`` for deterministic rows. The loader has no heavy optional dependency (``datasets`` is a core dep), so no fake-harness is needed; CI gets 100% line + branch coverage out of the box. Updates the stub registry: drop ``aime24`` from ``test_accuracy_config.STUB_BENCHMARKS``, drop the ``is_implemented: false`` flag from the ``aime24`` plugins.yaml entry, switch ``default_grader`` to ``lighteval_expr``, add an ``aime24`` row to ``docs/accuracy/accuracy-benchmarking.md``, and move it from "Still Stubbed" to "Implemented" in ``accuracy_stubs.md`` (refreshing the Status Summary, Method Count Summary, and Suggested Implementation Order sections accordingly). Signed-off-by: Elias Bermudez <dbermudez@nvidia.com>

ajcasagrande changed the title ~~feat(orchestrator,config): port sweep orchestrator, YAML config, search recipes~~ May 11, 2026

ajcasagrande force-pushed the ajc/sweep-orchestrator-port branch from 20e5ca9 to d2da6f2 Compare May 11, 2026 19:18

ajcasagrande changed the title ~~feat(orchestrator,config): Adaptive search sweep orchestrator, YAML v2 config, search recipes~~ May 11, 2026

github-actions Bot added the feat label May 11, 2026

ajcasagrande changed the title ~~feat: Adaptive search sweep orchestrator, YAML v2 config, search recipes~~ May 11, 2026

ajcasagrande changed the title ~~feat: YAML config language + sweep orchestrator with adaptive BO & recipes~~ May 11, 2026

ajcasagrande and others added 5 commits May 11, 2026 12:30

docs: condense NaN/Inf discipline guidance

e8fdc5e

Keep the four agent instruction files synchronized while moving detailed finite-value guidance to the canonical docs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Merge origin/main into ajc/sweep-orchestrator-port

b94cff5

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

debermudez reviewed May 14, 2026

View reviewed changes

ajcasagrande and others added 15 commits May 14, 2026 18:28

debermudez approved these changes May 15, 2026

View reviewed changes

ajcasagrande and others added 10 commits May 15, 2026 10:36

[pre-commit.ci] auto fixes from pre-commit.com hooks

1f99b0a

for more information, see https://pre-commit.ci

ajcasagrande enabled auto-merge (squash) May 15, 2026 19:43

ajcasagrande merged commit 94a9102 into main May 15, 2026
20 of 26 checks passed

ajcasagrande deleted the ajc/sweep-orchestrator-port branch May 15, 2026 19:49

ajcasagrande mentioned this pull request May 16, 2026

fix(ci): export MALLOC_ARENA_MAX=2 before pytest for component_integration #950

Merged

3 tasks

ajcasagrande mentioned this pull request May 19, 2026

feat: Cherry-pick YAML config, CI malloc, and test selection fixes to release/0.9.0 #958

Merged

ajcasagrande mentioned this pull request May 22, 2026

Feature request: Dynamic concurrency search based on previous benchmark results #883

Closed

JimmyWhitaker mentioned this pull request May 22, 2026

[BUG] -H/--header dropped in sweep iterations (Authorization redacted to literal "<redacted>") #981

Closed

FrankD412 mentioned this pull request May 22, 2026

fix(api): keep /api/results listener open after benchmark completes (DYN-701) #989

Merged

5 tasks

matthewkotila mentioned this pull request May 26, 2026

feat(api): add /api/run endpoint exposing run-identity metadata #997

Merged

matthewkotila mentioned this pull request May 27, 2026

fix(redact): tighten sensitive-token list to stop matching LLM token-count flags #1006

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: YAML-native v2 config + adaptive sweep orchestrator with BO & search recipes#912

feat: YAML-native v2 config + adaptive sweep orchestrator with BO & search recipes#912
ajcasagrande merged 53 commits into
mainfrom
ajc/sweep-orchestrator-port

ajcasagrande commented May 11, 2026 •

edited

Loading

github-actions Bot commented May 11, 2026 •

edited

Loading

github-actions Bot commented May 11, 2026 •

edited

Loading

debermudez left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

debermudez left a comment

copy-pr-bot Bot commented May 15, 2026

Uh oh!

Labels

2 participants

Uh oh!

Conversation

ajcasagrande commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

One pipeline at every cardinality

Search recipes → AdaptiveSearchSweep → planner

Adaptive search — propose → execute → record loop

Test plan

github-actions Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Try out this PR

github-actions Bot commented May 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

debermudez left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

debermudez left a comment

Choose a reason for hiding this comment

copy-pr-bot Bot commented May 15, 2026

Uh oh!

Labels

2 participants

ajcasagrande commented May 11, 2026 •

edited

Loading

Adaptive search — `propose → execute → record` loop

github-actions Bot commented May 11, 2026 •

edited

Loading

github-actions Bot commented May 11, 2026 •

edited

Loading