Skip to content

feat: Add parameter sweeping functionality#1

Open
Lokiiiiii wants to merge 140 commits into
mainfrom
param-sweep
Open

feat: Add parameter sweeping functionality#1
Lokiiiiii wants to merge 140 commits into
mainfrom
param-sweep

Conversation

@Lokiiiiii

@Lokiiiiii Lokiiiiii commented Feb 21, 2026

Copy link
Copy Markdown
Owner

Implement comprehensive parameter sweeping feature that allows users to benchmark across multiple parameter values (concurrency, request-rate, or both) with automatic aggregation and comparison.

Key features:

  • CLI support for sweeping concurrency and/or request-rate parameters
  • Automatic trial execution across parameter combinations
  • Sweep-specific aggregate metrics (min/mean/max/std/cv per parameter)
  • JSON and CSV exporters for sweep aggregate results
  • Integration with existing confidence interval runs
  • Comprehensive test coverage (unit, property-based, integration)

Changes:

  • Add sweep orchestration strategy and aggregation logic
  • Extend CLI with --sweep-concurrency and --sweep-request-rate options
  • Implement aggregate sweep JSON and CSV exporters
  • Add validation for parameter sweep configurations
  • Update trial directory naming from run_NNNN to trial_NNNN
  • Add comprehensive documentation and tutorials
  • Fix linting issues (exception chaining, loop variable binding)

Tests:

  • 9 integration tests covering all sweep scenarios
  • Property-based tests for sweep aggregation correctness
  • Unit tests for exporters, strategies, and aggregation
  • All tests passing with proper error handling

Documentation:

  • API reference for sweep aggregate formats
  • Tutorial for parameter sweeping workflows
  • Updated CLI options documentation

Summary by CodeRabbit

  • New Features

    • Parameter sweeping (repeated/independent) with per-value and sweep-wide aggregation, Pareto frontier detection, trends, and new JSON/CSV sweep exports.
  • CLI

    • Concurrency accepts lists for sequential sweeps; new sweep options: mode, cooldown_seconds, same-seed.
  • Behavior Changes

    • Dashboard UI is not supported with parameter sweeps and now raises a validation error.
  • Documentation

    • Added full parameter-sweeping guide, CLI docs, troubleshooting, and sweep-aggregates API reference.
  • Tests

    • Extensive unit tests for parsing, strategies, aggregation, exporters, and Pareto/trend logic.
  • Chores

    • Added .hypothesis/ and .kiro/ to .gitignore; hypothesis added to dev dependencies.
@coderabbitai

coderabbitai Bot commented Feb 21, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds parameter-sweep support across CLI, configs, orchestrator, aggregation (Pareto/trends), exporters (JSON/CSV), models, and extensive tests; introduces ParameterSweepStrategy, sweep-aware validators, SweepAggregation, exporters, and multi-run orchestration changes.

Changes

Cohort / File(s) Summary
Docs & Project Setup
\.gitignore, pyproject.toml, docs/api/sweep-aggregates.md, docs/cli_options.md, docs/tutorials/parameter-sweeping.md, docs/troubleshooting/parameter-sweeping-errors.md
Adds comprehensive parameter-sweep docs and troubleshooting; updates CLI docs for sweep options; adds Hypothesis to dev extras; adds .hypothesis/ and .kiro/ to .gitignore.
Config & Validation
src/aiperf/common/config/groups.py, src/aiperf/common/config/loadgen_config.py, src/aiperf/common/config/user_config.py
Adds PARAMETER_SWEEP group; makes concurrency accept int/str/list with parsing/validators; adds parameter_sweep_mode, parameter_sweep_cooldown_seconds, parameter_sweep_same_seed; introduces sweep detection and incompatibility/consistency validators.
Orchestration & CLI runner
src/aiperf/orchestrator/strategies.py, src/aiperf/orchestrator/orchestrator.py, src/aiperf/cli_runner.py
Adds ParameterSweepStrategy; sanitizes labels and switches run prefix to trial_...; refactors execute/_execute_single_run to label+artifact_path signatures; adds execute_and_export and composition helpers; adds UI compatibility checks and sweep-aware multi-run flow.
Aggregation & Analysis
src/aiperf/orchestrator/aggregation/sweep.py, src/aiperf/orchestrator/aggregation/__init__.py
New sweep aggregation module: OptimizationDirection, Objective, ParameterCombination, DEFAULT_PARETO_OBJECTIVES, identify_pareto_optimal, and SweepAggregation.compute; re-exported from aggregation package.
Exporters
src/aiperf/exporters/aggregate/__init__.py, src/aiperf/exporters/aggregate/aggregate_sweep_json_exporter.py, src/aiperf/exporters/aggregate/aggregate_sweep_csv_exporter.py
Adds AggregateSweepJsonExporter and AggregateSweepCsvExporter; exposes them in package __all__; implements JSON and CSV outputs for sweep aggregates (per-combination metrics, best configs, pareto points, metadata).
Models & Misc
src/aiperf/orchestrator/models.py, tests/harness/utils.py
Adds metadata: dict[str, Any] to RunResult; improves test harness failure messages to include stderr/stdout when present.
Tests
tests/unit/...
tests/unit/common/config/*, tests/unit/exporters/aggregate/*, tests/unit/orchestrator/*, tests/unit/test_cli_runner*.py, tests/unit/plot/*, tests/unit/strategies/*, tests/unit/plot/*, tests/unit/...
Adds and updates extensive unit and property tests covering loadgen/user config validators, strategies (ParameterSweepStrategy/FixedTrialsStrategy), orchestrator execution/aggregation/export flows, Pareto identification, exporters, CLI runner UI validation, and many edge cases.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Poem

🐇 I hop through metrics, nibble each sweep,
I chase Pareto carrots, tally highs and deep,
I label trials tidy — trial_001! —
I braid seeds and cooldowns, then dance with a beep,
Tiny paws on data — the aggregate’s complete.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title 'feat: Add parameter sweeping functionality' directly and concisely summarizes the main feature being added across the extensive changeset.
Docstring Coverage ✅ Passed Docstring coverage is 99.05% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🧹 Nitpick comments (5)
docs/tutorials/parameter-sweeping.md (1)

43-43: Consider adding language specifiers to fenced code blocks.

Several fenced code blocks for output examples lack language specifiers. While these are plain text outputs rather than code, adding text or leaving them as-is is acceptable for documentation. This is a minor linting suggestion.

Also applies to: 74-74, 101-101, 146-146, 164-164, 188-188, 389-389

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/tutorials/parameter-sweeping.md` at line 43, Several fenced code blocks
that show output examples are missing language specifiers; locate the bare
triple-backtick blocks (``` ) used for output examples in parameter-sweeping.md
and change them to include a language specifier like ```text (or ```output) so
the blocks are explicitly marked as plain text; update each instance where a
fenced block is used for output (the bare ``` occurrences) to use ```text for
consistency.
src/aiperf/common/config/user_config.py (1)

337-365: Use get_sweep_parameter() for sweep detection.
Keeps incompatibility checks aligned with parameter‑agnostic sweep detection as additional sweepable parameters are introduced.

♻️ Suggested change
-        is_sweep = isinstance(self.loadgen.concurrency, list)
+        is_sweep = self.loadgen.get_sweep_parameter() is not None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/common/config/user_config.py` around lines 337 - 365, The
validate_sweep_incompatibilities method currently detects sweeps by checking
isinstance(self.loadgen.concurrency, list); replace that check with a call to
get_sweep_parameter() so sweep detection is parameter‑agnostic as more sweepable
parameters are added. Update validate_sweep_incompatibilities to call
get_sweep_parameter() (or store its result) and use its truthiness to drive the
existing incompatibility checks (including the fixed-schedule check and the
_should_use_fixed_schedule_for_mooncake_trace() branch) so all sweep-related
logic relies on the unified sweep detector.
tests/unit/orchestrator/aggregation/test_sweep.py (1)

124-139: Consider consolidating function imports.

Throughout this file, identify_pareto_optimal, analyze_trends, and SweepAggregation are imported inside each test method. While this keeps tests self-contained, it adds verbosity. Consider adding these to the top-level imports alongside the existing ones.

♻️ Suggested import consolidation
 from aiperf.orchestrator.aggregation import (
     DEFAULT_PARETO_OBJECTIVES,
     Objective,
     OptimizationDirection,
+    SweepAggregation,
+    analyze_trends,
+    identify_pareto_optimal,
 )

Then remove the per-method imports like from aiperf.orchestrator.aggregation import identify_pareto_optimal.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/orchestrator/aggregation/test_sweep.py` around lines 124 - 139,
Tests import identify_pareto_optimal, analyze_trends, and SweepAggregation
inside individual test functions which is verbose; consolidate by adding a
single top-level import statement for these symbols (identify_pareto_optimal,
analyze_trends, SweepAggregation) alongside the existing module imports at the
top of the test file and remove the per-test inline imports so each test simply
uses the functions/classes directly.
src/aiperf/orchestrator/strategies.py (2)

502-515: Duplicate _sanitize_label method.

This method is identical to FixedTrialsStrategy._sanitize_label (lines 240-253). Consider extracting to a module-level utility function or moving to the ExecutionStrategy base class to avoid duplication.

♻️ Suggested refactor to base class
 class ExecutionStrategy(ABC):
     """Base class for execution strategies.
 
     Strategies decide:
     1. What config to run next (based on results so far)
     2. Whether to continue or stop
     3. How to label runs for artifact organization
     4. Where to store artifacts (path structure)
     5. Cooldown duration between runs
     """
 
+    def _sanitize_label(self, label: str) -> str:
+        """Sanitize label to prevent path traversal attacks.
+
+        Args:
+            label: Raw label string
+
+        Returns:
+            Sanitized label safe for filesystem paths
+        """
+        # Remove any path separators and parent directory references
+        sanitized = re.sub(r"[/\\]|\.\.", "", label)
+        # Remove any other potentially dangerous characters
+        sanitized = re.sub(r'[<>:"|?*]', "", sanitized)
+        return sanitized
+
     def validate_config(self, config: UserConfig) -> None:  # noqa: B027

Then remove the duplicate _sanitize_label methods from both FixedTrialsStrategy and ParameterSweepStrategy.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/orchestrator/strategies.py` around lines 502 - 515, The
_sanitize_label implementation is duplicated in FixedTrialsStrategy and
ParameterSweepStrategy; extract it into a single reusable location (either a
module-level utility function like sanitize_label(...) or as a protected method
on the ExecutionStrategy base class) and update both classes to call that
centralized function/method; then remove the duplicate _sanitize_label methods
from FixedTrialsStrategy and ParameterSweepStrategy so only the shared
implementation remains referenced by both.

458-467: Instance state mutation on first call.

The base_seed is lazily initialized on the first call to get_next_config. This is fine for single-threaded orchestration, but worth noting that this method is not thread-safe if the same strategy instance were used concurrently.

Consider documenting this assumption in the class docstring, or initializing base_seed in __init__ if the seed value is available at construction time.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/orchestrator/strategies.py` around lines 458 - 467, The lazy
initialization of self.base_seed inside get_next_config is not thread-safe;
either move the initialization logic into the class constructor (__init__) so
base_seed is set from config.input.random_seed or DEFAULT_SEED (honoring
auto_set_seed) at object creation, or explicitly document the concurrency
assumption in the class docstring of the strategy explaining that
get_next_config mutates instance state (base_seed) and is not safe for
concurrent calls; update references to DEFAULT_SEED, auto_set_seed, and
config.input.random_seed as needed to ensure the same behavior when moving the
logic to __init__.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/api/sweep-aggregates.md`:
- Around line 22-28: In docs/api/sweep-aggregates.md there are multiple
unlabeled fenced code blocks (e.g., the artifact tree starting with "artifacts/"
and the execution pattern blocks); update each triple-backtick fence to include
a language identifier (use "text" for artifact/command listings) so they become
```text; search for unlabeled ``` blocks in the file (including the other ranges
noted in the review) and replace them with ```text to satisfy markdownlint MD040
and keep formatting consistent.
- Around line 34-63: Update the documented JSON shape to match the exporter by
moving aggregation_type out of "metadata" to top-level and adding the exporter
fields aggregation_type, num_profile_runs, num_successful_runs, and failed_runs
to the Top-Level Structure and examples; keep the existing "metadata" object
(with parameter_name, parameter_values, num_values, num_trials_per_value,
sweep_mode, confidence_level, benchmark_config) but remove aggregation_type from
it so the example blocks (the Top-Level Structure and Metadata Section) reflect
the actual exporter output.

In `@docs/tutorials/parameter-sweeping.md`:
- Line 89: Update the tutorial snippet to use the valid CLI endpoint type:
replace the incorrect flag value "--endpoint-type openai_chat" with
"--endpoint-type chat" in the parameter-sweeping example (look for the CLI
invocation containing the "--endpoint-type" argument).

In `@src/aiperf/cli_runner.py`:
- Around line 578-629: The code currently calls SweepAggregation.compute with
sweep_values that may include values missing per-value aggregates
(per_value_stats), causing KeyError; filter sweep_values to only those present
in per_value_stats before calling SweepAggregation.compute and when computing
num_trials. Specifically, build a filtered list (e.g., sweep_values_filtered =
[v for v in sweep_values if v in per_value_stats]), pass that to
SweepAggregation.compute(per_value_stats, sweep_values_filtered,
sweep_param_name), and use results_by_value[sweep_values_filtered[0]] (or handle
empty case) to set metadata["num_trials_per_value"] so you don't index into a
missing value.

In `@src/aiperf/common/config/loadgen_config.py`:
- Around line 49-52: The validator currently returns a list as-is (if
isinstance(v, list): return v) and later performs numeric comparisons that
assume integers; modify this to validate list element types before numeric
checks by iterating v and ensuring every element is an int (raise ValueError if
any element is not an int), and likewise for the single-value branch add an
explicit isinstance(v, int) check before doing comparisons like v < 1 (raise
ValueError on wrong type); apply the same change to the other validator block
noted around the 666-678 region so both paths enforce integer-only lists and
clear ValueError messages.

In `@src/aiperf/exporters/aggregate/aggregate_sweep_csv_exporter.py`:
- Around line 59-80: The code currently uses sorted(per_value_metrics.keys(),
key=int) which assumes integer sweep values and will fail for floats; update the
CSV row ordering in the block that builds rows (where per_value_metrics and
metric_names are used and writer.writerow is called) to use the canonical order
from the sweep metadata (parameter_values) if available, otherwise perform a
numeric sort that parses keys to float (not int) with a safe fallback to string
ordering for non-numeric keys; ensure you replace both places where sorted(...,
key=int) is used (first_value selection and the loop over
per_value_metrics.keys()) so ordering works for float request-rate sweeps and
mixed/non-numeric values.

In `@src/aiperf/orchestrator/orchestrator.py`:
- Around line 252-263: The code assumes the sweep parameter is named
"concurrency" which breaks other sweep types; make sweep handling
parameter-agnostic by detecting the actual sweep key from result metadata (e.g.,
inspect r.metadata keys or a 'parameter_name' field) instead of hardcoding
"concurrency", then use that detected key when building composed-run metadata
and when calling _collect_failed_sweep_values; update the composition path that
populates sweep metadata (the independent-mode composition block and the other
occurrences noted around the _collect_failed_sweep_values usage) so both
composed runs and sweep-only logging use the same detected sweep parameter name
rather than "concurrency".

In `@tests/unit/exporters/aggregate/test_sweep_exporters.py`:
- Around line 245-255: The test constructs an AggregateResult with the full
sweep_data in the metrics field which hides exporter regressions; update the
AggregateResult instantiation (the AggregateResult class and its metrics
parameter) to pass only the per_value_metrics subset (use per_value_metrics
instead of sweep_data) so the test mirrors the production shape expected by the
exporters (locate the AggregateResult(...) where metrics=sweep_data and replace
with metrics=per_value_metrics).

In `@tests/unit/orchestrator/test_parameter_sweep_properties.py`:
- Around line 54-59: The test constructs a RunResult with a hard-coded
artifacts_path using Path(f"/tmp/{label}") which triggers Ruff S108 and fails on
non-Unix systems; update the test to use a portable temp path by replacing
Path(f"/tmp/{label}") with a path built from tempfile.gettempdir() (e.g.
Path(tempfile.gettempdir()) / label) or use the pytest tmp_path fixture to
construct artifacts_path, keeping the RunResult instantiation (RunResult(...,
artifacts_path=...)) otherwise unchanged.

In `@tests/unit/test_cli_runner_aggregation.py`:
- Around line 21-26: Replace the Unicode multiplication sign '×' with the ASCII
'x' in the docstrings that contain "Structure: 2 trials × 3 concurrency values =
6 runs" (and the similar occurrence around the later docstring block) so the
text reads "2 trials x 3 concurrency values = 6 runs" to satisfy Ruff RUF002;
update each docstring occurrence accordingly.

---

Nitpick comments:
In `@docs/tutorials/parameter-sweeping.md`:
- Line 43: Several fenced code blocks that show output examples are missing
language specifiers; locate the bare triple-backtick blocks (``` ) used for
output examples in parameter-sweeping.md and change them to include a language
specifier like ```text (or ```output) so the blocks are explicitly marked as
plain text; update each instance where a fenced block is used for output (the
bare ``` occurrences) to use ```text for consistency.

In `@src/aiperf/common/config/user_config.py`:
- Around line 337-365: The validate_sweep_incompatibilities method currently
detects sweeps by checking isinstance(self.loadgen.concurrency, list); replace
that check with a call to get_sweep_parameter() so sweep detection is
parameter‑agnostic as more sweepable parameters are added. Update
validate_sweep_incompatibilities to call get_sweep_parameter() (or store its
result) and use its truthiness to drive the existing incompatibility checks
(including the fixed-schedule check and the
_should_use_fixed_schedule_for_mooncake_trace() branch) so all sweep-related
logic relies on the unified sweep detector.

In `@src/aiperf/orchestrator/strategies.py`:
- Around line 502-515: The _sanitize_label implementation is duplicated in
FixedTrialsStrategy and ParameterSweepStrategy; extract it into a single
reusable location (either a module-level utility function like
sanitize_label(...) or as a protected method on the ExecutionStrategy base
class) and update both classes to call that centralized function/method; then
remove the duplicate _sanitize_label methods from FixedTrialsStrategy and
ParameterSweepStrategy so only the shared implementation remains referenced by
both.
- Around line 458-467: The lazy initialization of self.base_seed inside
get_next_config is not thread-safe; either move the initialization logic into
the class constructor (__init__) so base_seed is set from
config.input.random_seed or DEFAULT_SEED (honoring auto_set_seed) at object
creation, or explicitly document the concurrency assumption in the class
docstring of the strategy explaining that get_next_config mutates instance state
(base_seed) and is not safe for concurrent calls; update references to
DEFAULT_SEED, auto_set_seed, and config.input.random_seed as needed to ensure
the same behavior when moving the logic to __init__.

In `@tests/unit/orchestrator/aggregation/test_sweep.py`:
- Around line 124-139: Tests import identify_pareto_optimal, analyze_trends, and
SweepAggregation inside individual test functions which is verbose; consolidate
by adding a single top-level import statement for these symbols
(identify_pareto_optimal, analyze_trends, SweepAggregation) alongside the
existing module imports at the top of the test file and remove the per-test
inline imports so each test simply uses the functions/classes directly.
Comment thread docs/api/sweep-aggregates.md Outdated
Comment thread docs/api/sweep-aggregates.md
Comment thread docs/tutorials/parameter-sweeping.md Outdated
Comment thread src/aiperf/cli_runner.py Outdated
Comment thread src/aiperf/common/config/loadgen_config.py Outdated
Comment thread src/aiperf/exporters/aggregate/aggregate_sweep_csv_exporter.py Outdated
Comment thread src/aiperf/orchestrator/orchestrator.py Outdated
Comment on lines +245 to +255
aggregate = AggregateResult(
aggregation_type="sweep",
num_runs=10,
num_successful_runs=5,
failed_runs=[
{"run_index": 5, "error": "Connection timeout"},
{"run_index": 6, "error": "Connection timeout"},
],
metadata={},
metrics=sweep_data,
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Use per_value_metrics for AggregateResult.metrics in this test.
Passing the full sweep_data dict can mask exporter regressions because the production path expects only per_value_metrics.

✅ Proposed fix
-            metrics=sweep_data,
+            metrics=sweep_data["per_value_metrics"],
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/exporters/aggregate/test_sweep_exporters.py` around lines 245 -
255, The test constructs an AggregateResult with the full sweep_data in the
metrics field which hides exporter regressions; update the AggregateResult
instantiation (the AggregateResult class and its metrics parameter) to pass only
the per_value_metrics subset (use per_value_metrics instead of sweep_data) so
the test mirrors the production shape expected by the exporters (locate the
AggregateResult(...) where metrics=sweep_data and replace with
metrics=per_value_metrics).
Comment thread tests/unit/orchestrator/test_parameter_sweep_properties.py
Comment thread tests/unit/test_cli_runner_aggregation.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (5)
tests/unit/common/config/test_loadgen_config_validators.py (4)

389-399: Misleading test method name.

Same issue: test_parameter_sweep_same_seed_with_single_concurrency_succeeds actually expects a ValidationError.

✏️ Suggested rename
-    def test_parameter_sweep_same_seed_with_single_concurrency_succeeds(self):
-        """Test that parameter_sweep_same_seed with single concurrency now raises error (requires list)."""
+    def test_parameter_sweep_same_seed_with_single_concurrency_raises_error(self):
+        """Test that parameter_sweep_same_seed with single concurrency raises error (requires list)."""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/common/config/test_loadgen_config_validators.py` around lines 389
- 399, Rename the test function
test_parameter_sweep_same_seed_with_single_concurrency_succeeds to a name that
reflects it expects a ValidationError (e.g.,
test_parameter_sweep_same_seed_with_single_concurrency_raises), and update any
references if used elsewhere; ensure the body still asserts that constructing
LoadGeneratorConfig(concurrency=10, parameter_sweep_same_seed=True) raises
ValidationError and that the error message contains the expected text.

326-336: Misleading test method name.

Similar to the previous case: test_parameter_sweep_cooldown_with_single_concurrency_succeeds actually expects a ValidationError.

✏️ Suggested rename
-    def test_parameter_sweep_cooldown_with_single_concurrency_succeeds(self):
-        """Test that parameter_sweep_cooldown_seconds with single concurrency now raises error (requires list)."""
+    def test_parameter_sweep_cooldown_with_single_concurrency_raises_error(self):
+        """Test that parameter_sweep_cooldown_seconds with single concurrency raises error (requires list)."""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/common/config/test_loadgen_config_validators.py` around lines 326
- 336, Rename the misnamed test method
test_parameter_sweep_cooldown_with_single_concurrency_succeeds to reflect that
it expects a ValidationError (e.g.,
test_parameter_sweep_cooldown_with_single_concurrency_raises or _fails) so the
name matches the assertion; update the test function name in the test file and
any references to it (the function is the test method in
tests/unit/common/config/test_loadgen_config_validators.py that instantiates
LoadGeneratorConfig(concurrency=10, parameter_sweep_cooldown_seconds=3.0) and
asserts pytest.raises(ValidationError)).

427-445: Test contains unreachable code and incomplete implementation.

The code after the with pytest.raises(ValidationError): block (lines 438-441) is unreachable because the exception will exit the context. The subsequent lines 444-445 then test a different scenario without clear separation. The comment "For now, test that default value works without error" suggests this is intentionally incomplete.

Consider either completing the test or splitting it into two separate test methods for clarity.

✏️ Suggested fix: split into two clear tests
     def test_parameter_sweep_mode_with_single_concurrency_raises_error(self):
-        """Test that explicitly setting parameter_sweep_mode with single concurrency raises ValueError."""
-        # Note: We need to simulate explicit setting via model_fields_set
-        # In actual CLI usage, cyclopts would set this field
-        config_dict = {
-            "concurrency": 10,
-            "parameter_sweep_mode": "repeated",
-        }
-        # Manually mark the field as set
+        """Test that explicitly setting parameter_sweep_mode with single concurrency raises ValueError."""
         with pytest.raises(ValidationError):
-            config = LoadGeneratorConfig(**config_dict)
-            # Simulate explicit setting
-            config.model_fields_set.add("parameter_sweep_mode")
-            # Trigger validation
-            config.model_validate(config.model_dump())
-
-        # For now, test that default value works without error
+            LoadGeneratorConfig(concurrency=10, parameter_sweep_mode="repeated")
+
+    def test_parameter_sweep_mode_default_with_single_concurrency_succeeds(self):
+        """Test that default parameter_sweep_mode with single concurrency succeeds."""
         config = LoadGeneratorConfig(concurrency=10)
         assert config.parameter_sweep_mode == "repeated"  # default value is fine
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/common/config/test_loadgen_config_validators.py` around lines 427
- 445, The test test_parameter_sweep_mode_with_single_concurrency_raises_error
mixes two scenarios and contains unreachable code after the with
pytest.raises(ValidationError) block; split it into two focused tests: one
(e.g., test_parameter_sweep_mode_explicit_with_single_concurrency_raises_error)
that constructs LoadGeneratorConfig(concurrency=10,
parameter_sweep_mode="repeated"), adds "parameter_sweep_mode" to
config.model_fields_set, and asserts a ValidationError is raised when calling
config.model_validate(config.model_dump()), and a second test (e.g.,
test_parameter_sweep_mode_default_with_single_concurrency_no_error) that
constructs LoadGeneratorConfig(concurrency=10) and asserts
config.parameter_sweep_mode == "repeated"; ensure no code remains after the
pytest.raises context in the first test so there is no unreachable code.

264-273: Misleading test method name.

The method is named test_parameter_sweep_mode_with_single_concurrency_succeeds but it actually tests that the operation raises a ValidationError. This can confuse future maintainers.

✏️ Suggested rename
-    def test_parameter_sweep_mode_with_single_concurrency_succeeds(self):
-        """Test that parameter_sweep_mode with single concurrency now raises error (requires list)."""
+    def test_parameter_sweep_mode_with_single_concurrency_raises_error(self):
+        """Test that parameter_sweep_mode with single concurrency raises error (requires list)."""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/common/config/test_loadgen_config_validators.py` around lines 264
- 273, Rename the misleading test method
test_parameter_sweep_mode_with_single_concurrency_succeeds to reflect that it
expects a ValidationError (for example
test_parameter_sweep_mode_with_single_concurrency_raises_validation_error), and
update the docstring to indicate it asserts a failure; keep the body intact (the
with pytest.raises(ValidationError) block and the assertion on error_msg) and
ensure any references to the old name (e.g., in test discovery or comments) are
updated accordingly; the relevant symbols are the test method name and the
LoadGeneratorConfig(concurrency=10, parameter_sweep_mode="independent")
call/assertion.
src/aiperf/common/config/loadgen_config.py (1)

652-690: Validation assumes list elements are already integers.

The validator performs v < 1 comparisons on list elements without first confirming they are integers. This pairs with the issue in parse_concurrency_list - if non-integer values slip through, this comparison could behave unexpectedly (floats would pass, non-numeric types would raise TypeError).

This is mitigated if the type validation suggested for parse_concurrency_list is implemented.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/common/config/loadgen_config.py` around lines 652 - 690, The
validator validate_concurrency_list currently assumes list elements are integers
and does comparisons like v < 1 which can raise TypeError for non-numeric types;
update validate_concurrency_list (and ensure parity with parse_concurrency_list)
to first verify each item in self.concurrency is an int (or safely coercible to
int if you want to accept numeric strings/floats), collect positions of any
non-integer items and raise a clear ValueError pointing to those positions, or
coerce valid numeric values to int before running the existing < 1 checks;
reference self.concurrency, validate_concurrency_list, and
parse_concurrency_list when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@src/aiperf/common/config/loadgen_config.py`:
- Around line 49-51: The current branch in loadgen_config.py returns a list
early without validating element types; update the logic in the function
validate_concurrency_list (or the function that handles variable v) to iterate
the list when isinstance(v, list) and validate each element is an integer >= 1
(or coerce numeric strings/floats to int if intended) and otherwise raise a
clear ValueError; ensure the subsequent numeric comparison (v < 1) is applied
per element after validation/coercion so non-integer types (e.g., floats,
strings) are caught or converted before returning the list.

---

Nitpick comments:
In `@src/aiperf/common/config/loadgen_config.py`:
- Around line 652-690: The validator validate_concurrency_list currently assumes
list elements are integers and does comparisons like v < 1 which can raise
TypeError for non-numeric types; update validate_concurrency_list (and ensure
parity with parse_concurrency_list) to first verify each item in
self.concurrency is an int (or safely coercible to int if you want to accept
numeric strings/floats), collect positions of any non-integer items and raise a
clear ValueError pointing to those positions, or coerce valid numeric values to
int before running the existing < 1 checks; reference self.concurrency,
validate_concurrency_list, and parse_concurrency_list when making the change.

In `@tests/unit/common/config/test_loadgen_config_validators.py`:
- Around line 389-399: Rename the test function
test_parameter_sweep_same_seed_with_single_concurrency_succeeds to a name that
reflects it expects a ValidationError (e.g.,
test_parameter_sweep_same_seed_with_single_concurrency_raises), and update any
references if used elsewhere; ensure the body still asserts that constructing
LoadGeneratorConfig(concurrency=10, parameter_sweep_same_seed=True) raises
ValidationError and that the error message contains the expected text.
- Around line 326-336: Rename the misnamed test method
test_parameter_sweep_cooldown_with_single_concurrency_succeeds to reflect that
it expects a ValidationError (e.g.,
test_parameter_sweep_cooldown_with_single_concurrency_raises or _fails) so the
name matches the assertion; update the test function name in the test file and
any references to it (the function is the test method in
tests/unit/common/config/test_loadgen_config_validators.py that instantiates
LoadGeneratorConfig(concurrency=10, parameter_sweep_cooldown_seconds=3.0) and
asserts pytest.raises(ValidationError)).
- Around line 427-445: The test
test_parameter_sweep_mode_with_single_concurrency_raises_error mixes two
scenarios and contains unreachable code after the with
pytest.raises(ValidationError) block; split it into two focused tests: one
(e.g., test_parameter_sweep_mode_explicit_with_single_concurrency_raises_error)
that constructs LoadGeneratorConfig(concurrency=10,
parameter_sweep_mode="repeated"), adds "parameter_sweep_mode" to
config.model_fields_set, and asserts a ValidationError is raised when calling
config.model_validate(config.model_dump()), and a second test (e.g.,
test_parameter_sweep_mode_default_with_single_concurrency_no_error) that
constructs LoadGeneratorConfig(concurrency=10) and asserts
config.parameter_sweep_mode == "repeated"; ensure no code remains after the
pytest.raises context in the first test so there is no unreachable code.
- Around line 264-273: Rename the misleading test method
test_parameter_sweep_mode_with_single_concurrency_succeeds to reflect that it
expects a ValidationError (for example
test_parameter_sweep_mode_with_single_concurrency_raises_validation_error), and
update the docstring to indicate it asserts a failure; keep the body intact (the
with pytest.raises(ValidationError) block and the assertion on error_msg) and
ensure any references to the old name (e.g., in test discovery or comments) are
updated accordingly; the relevant symbols are the test method name and the
LoadGeneratorConfig(concurrency=10, parameter_sweep_mode="independent")
call/assertion.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
src/aiperf/exporters/aggregate/aggregate_sweep_csv_exporter.py (1)

76-82: Consider unioning metric names across all values.

Using only the first value’s metrics can omit metrics present in other values. A simple union avoids silent data loss if per‑value metric sets drift.

♻️ Optional refactor
-            if value_keys:
-                first_value = value_keys[0]
-                metric_names = sorted(per_value_metrics[first_value].keys())
-            else:
-                metric_names = []
+            if value_keys:
+                metric_names = sorted(
+                    {
+                        metric_name
+                        for value in value_keys
+                        for metric_name in per_value_metrics.get(value, {}).keys()
+                    }
+                )
+            else:
+                metric_names = []
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/exporters/aggregate/aggregate_sweep_csv_exporter.py` around lines
76 - 82, The current logic only uses the first entry in value_keys to build
metric_names, which can miss metrics present in other entries; update the code
that computes metric_names (currently referencing value_keys, first_value and
per_value_metrics) to iterate all keys in per_value_metrics, take the union of
all metric keys across every per_value_metrics[value] entry, then sort that
union into metric_names (fall back to empty list if none). Ensure any downstream
logic that expects metric_names remains unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/troubleshooting/parameter-sweeping-errors.md`:
- Around line 9-13: Update all fenced code blocks in the document that show
plain error/notice messages to include a language identifier (e.g., use ```text
instead of ```). Specifically modify the block containing "Invalid concurrency
value: 'abc'. Must be a positive integer (>= 1)." and the other similar message
blocks at the ranges noted (30-35, 52-58, 79-85, 110-115, 141-147, 169-175,
195-200, 220-225, 235-240) so each opening fence includes a language token like
text to satisfy markdownlint MD040.

In `@docs/tutorials/parameter-sweeping.md`:
- Around line 43-45: The markdown contains several non‑code‑fenced ASCII/output
blocks like the snippet "Parameter sweep mode: UI automatically set to 'simple'
(use '--ui none' to disable UI output)"; update each triple‑backtick fence to
include a language identifier (preferably text) so they read ```text ... ```,
and apply the same change to all similar output blocks in this document (the
other blocks referenced in the review). Ensure every non-code ASCII/output block
uses ```text to satisfy markdownlint MD040.

In `@src/aiperf/exporters/aggregate/aggregate_sweep_csv_exporter.py`:
- Around line 59-75: The current logic builds value_keys from
self._result.metadata["parameter_values"] but can yield an empty list if
stringified metadata entries don't match per_value_metrics keys; update the
block around parameter_values/per_value_metrics to normalize and compare keys
robustly (e.g., try matching both str and numeric forms) and if the filtered
value_keys is empty fall back to the numeric_sort_key-based
sorted(per_value_metrics.keys()) approach; ensure you keep numeric_sort_key and
the existing fallback behavior while only using the metadata canonical order
when a non-empty normalized match list is produced.

In `@src/aiperf/orchestrator/strategies.py`:
- Around line 520-527: Docstring examples for the composed sweep paths still use
"run_0001" but FixedTrialsStrategy now emits "trial_0001"; update the examples
in the sweep-related docstring in strategies.py to replace "run_0001" with
"trial_0001" for both repeated and independent mode examples so they reflect the
new FixedTrialsStrategy naming (search for the block containing "Composed with
FixedTrialsStrategy" and adjust the path examples accordingly).

---

Nitpick comments:
In `@src/aiperf/exporters/aggregate/aggregate_sweep_csv_exporter.py`:
- Around line 76-82: The current logic only uses the first entry in value_keys
to build metric_names, which can miss metrics present in other entries; update
the code that computes metric_names (currently referencing value_keys,
first_value and per_value_metrics) to iterate all keys in per_value_metrics,
take the union of all metric keys across every per_value_metrics[value] entry,
then sort that union into metric_names (fall back to empty list if none). Ensure
any downstream logic that expects metric_names remains unchanged.
Comment thread docs/troubleshooting/parameter-sweeping-errors.md
Comment thread docs/tutorials/parameter-sweeping.md Outdated
Comment thread src/aiperf/exporters/aggregate/aggregate_sweep_csv_exporter.py Outdated
Comment thread src/aiperf/orchestrator/strategies.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
src/aiperf/orchestrator/strategies.py (2)

489-501: Potential IndexError if run_index exceeds parameter_values length.

get_run_label is a public method that could be called with any run_index. If called with an index >= len(self.parameter_values), line 498 will raise IndexError. Consider adding bounds checking or documenting the precondition.

💡 Optional defensive fix
     def get_run_label(self, run_index: int) -> str:
         """Generate label: concurrency_10, concurrency_20, etc.

         Args:
             run_index: Zero-based index of run

         Returns:
             Sanitized label safe for filesystem paths
+
+        Raises:
+            IndexError: If run_index >= len(parameter_values)
         """
+        if run_index >= len(self.parameter_values):
+            raise IndexError(
+                f"run_index {run_index} out of bounds for {len(self.parameter_values)} parameter values"
+            )
         value = self.parameter_values[run_index]
         label = f"{self.parameter_name}_{value}"
         # Sanitize label to prevent path traversal
         return _sanitize_label(label)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/orchestrator/strategies.py` around lines 489 - 501, get_run_label
currently indexes self.parameter_values without validation which can raise
IndexError; add explicit bounds checking at the start of get_run_label to ensure
0 <= run_index < len(self.parameter_values) and raise a clear exception (e.g.,
ValueError or IndexError with a descriptive message) if out of range, or
alternatively clamp/handle invalid indices as desired; reference the method
get_run_label, the attribute self.parameter_values and self.parameter_name, and
still call _sanitize_label(label) for the returned label.

22-35: Consider handling edge case where sanitization produces empty string.

The sanitization logic is sound for preventing path traversal. However, if a malformed label somehow results in an empty string after sanitization, it could cause issues downstream (e.g., creating a directory named "").

💡 Optional defensive fix
 def _sanitize_label(label: str) -> str:
     """Sanitize label to prevent path traversal attacks.

     Args:
         label: Raw label string

     Returns:
         Sanitized label safe for filesystem paths
+
+    Raises:
+        ValueError: If sanitization produces empty string
     """
     # Remove any path separators and parent directory references
     sanitized = re.sub(r"[/\\]|\.\.", "", label)
     # Remove any other potentially dangerous characters
     sanitized = re.sub(r'[<>:"|?*]', "", sanitized)
+    if not sanitized:
+        raise ValueError(f"Label '{label}' sanitizes to empty string")
     return sanitized
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/orchestrator/strategies.py` around lines 22 - 35, The
_sanitize_label function can return an empty string for some inputs; update
_sanitize_label to detect when sanitized == "" and handle it defensively —
either raise a clear ValueError (including the original label in the message) or
return a safe fallback like "unnamed_label"; ensure callers expect/handle this
by documenting the behavior in the function docstring. Keep the sanitization
steps (re.sub calls) intact, then add the empty-check and raise/return behavior
immediately after them in the _sanitize_label function.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/tutorials/parameter-sweeping.md`:
- Around line 74-77: Several fenced code blocks in
docs/tutorials/parameter-sweeping.md are missing a language identifier; update
each output/structure block (including the one shown and the blocks at lines
101-119, 146-152, 164-168, 188-223, 389-403) to use the text identifier by
changing the opening fence from ``` to ```text so the blocks render as plain
text; scan the file for other similar output blocks and apply the same change
consistently.

In `@src/aiperf/exporters/aggregate/aggregate_sweep_csv_exporter.py`:
- Around line 62-68: The code currently always appends str(v) to value_keys
causing KeyError when per_value_metrics uses numeric keys; change the logic in
the loop over parameter_values to append the actual matching key (either v if v
in per_value_metrics or str(v) if str(v) in per_value_metrics) so later lookups
use the same type as stored in per_value_metrics, and when writing CSV convert
that key to a string for output (i.e., preserve the matched_key in value_keys
and use str(matched_key) only for headers/printing); apply the same fix to the
second block referenced around the 89-115 region so all lookups use the
preserved actual key.

---

Nitpick comments:
In `@src/aiperf/orchestrator/strategies.py`:
- Around line 489-501: get_run_label currently indexes self.parameter_values
without validation which can raise IndexError; add explicit bounds checking at
the start of get_run_label to ensure 0 <= run_index < len(self.parameter_values)
and raise a clear exception (e.g., ValueError or IndexError with a descriptive
message) if out of range, or alternatively clamp/handle invalid indices as
desired; reference the method get_run_label, the attribute self.parameter_values
and self.parameter_name, and still call _sanitize_label(label) for the returned
label.
- Around line 22-35: The _sanitize_label function can return an empty string for
some inputs; update _sanitize_label to detect when sanitized == "" and handle it
defensively — either raise a clear ValueError (including the original label in
the message) or return a safe fallback like "unnamed_label"; ensure callers
expect/handle this by documenting the behavior in the function docstring. Keep
the sanitization steps (re.sub calls) intact, then add the empty-check and
raise/return behavior immediately after them in the _sanitize_label function.
Comment thread docs/tutorials/parameter-sweeping.md Outdated
Comment thread src/aiperf/exporters/aggregate/aggregate_sweep_csv_exporter.py Outdated

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (4)
docs/tutorials/parameter-sweeping.md (2)

74-77: ⚠️ Potential issue | 🟡 Minor

Add language identifiers to remaining fenced blocks.

Several output blocks still use bare fences (e.g., Lines 74‑77). markdownlint MD040 will continue to warn. Please change them totext (and apply consistently to the other unlabeled output blocks in this file).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/tutorials/parameter-sweeping.md` around lines 74 - 77, The other fenced
output blocks are unlabeled (e.g., the block containing "ValueError: Dashboard
UI is not supported with parameter sweeps..."); update every bare ``` output
fence in this document to use the text language identifier by replacing ``` with
```text so markdownlint MD040 warnings are resolved and all unlabeled output
blocks (including that error block) are consistently labeled.

236-334: ⚠️ Potential issue | 🟡 Minor

Align sweep aggregate example with current per‑combination schema.

The example still documents the older per_value_metrics map and scalar pareto_optimal values. Current sweep outputs use a per_combination_metrics list, and best_configurations / pareto_optimal carry parameter dictionaries. This will mislead users when they parse real output.

✅ Suggested doc alignment (excerpt)
 {
   "metadata": {
-    "parameter_name": "concurrency",
-    "parameter_values": [10, 20, 30, 40],
-    "num_values": 4,
+    "sweep_parameters": [{ "name": "concurrency", "values": [10, 20, 30, 40] }],
+    "num_combinations": 4,
     "num_trials_per_value": 5,
     "sweep_mode": "repeated",
     "confidence_level": 0.95
   },
-  "per_value_metrics": {
-    "10": { ... }
-  },
+  "per_combination_metrics": [
+    { "parameters": { "concurrency": 10 }, "metrics": { ... } }
+  ],
   "best_configurations": {
     "best_throughput": {
-      "value": 40,
+      "parameters": { "concurrency": 40 },
       "metric": 255.1,
       "unit": "requests/sec"
     }
   },
-  "pareto_optimal": [10, 30, 40]
+  "pareto_optimal": [{ "concurrency": 10 }, { "concurrency": 30 }, { "concurrency": 40 }]
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/tutorials/parameter-sweeping.md` around lines 236 - 334, The example
JSON uses the old per_value_metrics map and scalar
pareto_optimal/best_configuration values; update the example to use the current
schema: replace "per_value_metrics" with a "per_combination_metrics" array where
each element includes a "parameters" object (e.g., {"concurrency": 10}) plus the
same metric objects (request_throughput_avg, ttft_p99_ms), change
"pareto_optimal" from [10,30,40] to a list of parameter dictionaries (e.g.,
[{"concurrency":10}, {"concurrency":30}, {"concurrency":40}]), and change
entries under "best_configurations" (best_throughput, best_latency_p99) so their
"value" fields are parameter dictionaries instead of scalars; ensure metric
fields (mean, std, ci_low/ci_high) remain intact and example units stay the
same.
tests/unit/orchestrator/test_parameter_sweep_properties.py (1)

9-58: ⚠️ Potential issue | 🟡 Minor

Avoid hard-coded /tmp in tests (portable temp dir).

This is non-portable and still trips Ruff S108. Prefer tempfile.gettempdir() or tmp_path.

🛠️ Suggested fix
+import tempfile
@@
-        artifacts_path=Path(f"/tmp/{label}"),
+        artifacts_path=Path(tempfile.gettempdir()) / label,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/orchestrator/test_parameter_sweep_properties.py` around lines 9 -
58, The test helper make_run_result currently hardcodes artifacts_path to
Path(f"/tmp/{label}"), which is non-portable and triggers Ruff S108; update
make_run_result to accept an optional artifacts_path parameter (or a tmp_path
fixture) and use that when provided, otherwise derive a portable temp directory
via tempfile.gettempdir() (or Path(tmp_path) when caller supplies tmp_path),
then set RunResult(..., artifacts_path=that_path) so tests use a portable temp
directory instead of '/tmp'.
src/aiperf/orchestrator/orchestrator.py (1)

474-475: ⚠️ Potential issue | 🟠 Major

Sweep metadata is still hardcoded to concurrency, breaking request_rate sweeps.

Sweep-only and composed paths set/inspect only concurrency, so request‑rate sweeps won’t be detected or aggregated correctly (and failure reporting won’t trigger). Use the actual sweep parameter name/value everywhere sweep metadata is written/read.

🛠️ Suggested fix (parameter‑agnostic metadata)
@@
-            result = self._execute_single_run(run_config, label, artifact_path)
+            result = self._execute_single_run(run_config, label, artifact_path)
+            if isinstance(strategy, ParameterSweepStrategy):
+                param_name = strategy.parameter_name
+                param_value = getattr(run_config.loadgen, param_name)
+                result.metadata.update(
+                    {
+                        "parameter_name": param_name,
+                        "parameter_value": param_value,
+                        param_name: param_value,
+                    }
+                )
@@
-        if any("concurrency" in r.metadata for r in results):
+        if any("parameter_name" in r.metadata for r in results):
             failed_values = self._collect_failed_sweep_values(results)
@@
-                result.metadata.update(
-                    {
-                        "trial_index": trial_index,
-                        "value_index": value_index,
-                        "concurrency": run_config.loadgen.concurrency,
-                        "sweep_mode": "repeated",
-                    }
-                )
+                param_name = sweep_strategy.parameter_name
+                param_value = getattr(run_config.loadgen, param_name)
+                result.metadata.update(
+                    {
+                        "trial_index": trial_index,
+                        "value_index": value_index,
+                        "parameter_name": param_name,
+                        "parameter_value": param_value,
+                        param_name: param_value,
+                        "sweep_mode": "repeated",
+                    }
+                )
@@
-                result.metadata.update(
-                    {
-                        "trial_index": trial_index,
-                        "value_index": value_index,
-                        "concurrency": run_config.loadgen.concurrency,
-                        "sweep_mode": "independent",
-                    }
-                )
+                param_name = sweep_strategy.parameter_name
+                param_value = getattr(run_config.loadgen, param_name)
+                result.metadata.update(
+                    {
+                        "trial_index": trial_index,
+                        "value_index": value_index,
+                        "parameter_name": param_name,
+                        "parameter_value": param_value,
+                        param_name: param_value,
+                        "sweep_mode": "independent",
+                    }
+                )
@@
-            if not result.success and "concurrency" in result.metadata:
-                param_value = result.metadata.get("concurrency")
-                param_name = "concurrency"  # Currently only concurrency is supported
+            if not result.success and "parameter_name" in result.metadata:
+                param_name = result.metadata.get("parameter_name")
+                param_value = result.metadata.get("parameter_value")

Also applies to: 497-503, 585-593, 711-718, 973-977

🧹 Nitpick comments (1)
src/aiperf/exporters/aggregate/aggregate_sweep_csv_exporter.py (1)

51-56: Remove duplicate assignment of per_combination_metrics.

This is harmless but adds noise in a critical formatter.

♻️ Suggested cleanup
-        # Section 1: Per-combination metrics table
-        per_combination_metrics = self._result.metrics
-
         # Section 1: Per-combination metrics table
         per_combination_metrics = self._result.metrics
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/exporters/aggregate/aggregate_sweep_csv_exporter.py` around lines
51 - 56, There are two identical assignments to per_combination_metrics in
aggregate_sweep_csv_exporter.py; remove the duplicate so only a single
assignment remains (keep the line per_combination_metrics = self._result.metrics
and delete the repeated one) to avoid redundant code in the method that builds
the CSV export.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/api/sweep-aggregates.md`:
- Around line 66-90: The metadata example in docs incorrectly includes
best_configurations and pareto_optimal inside the metadata object; remove the
best_configurations and pareto_optimal entries from the metadata JSON example
and from the Fields table (leave them documented as top-level fields if needed),
so the example matches the exporter which filters these keys out of metadata;
update references to metadata, sweep_parameters, and num_combinations
accordingly (ensure only sweep_parameters and num_combinations remain in the
metadata example and Fields table).

In `@src/aiperf/exporters/aggregate/aggregate_sweep_json_exporter.py`:
- Around line 68-85: The _generate_content() currently filters out "trends" from
self._result.metadata and never writes it back, causing trend data to be
dropped; update the logic in aggregate_sweep_json_exporter (inside
_generate_content) to preserve trends by either not removing "trends" when
building metadata or by explicitly adding output["trends"] =
self._result.metadata.get("trends", {}) (reference symbols: _generate_content,
self._result.metadata, output, "trends") so the serialized JSON includes trend
analysis.

In `@tests/unit/exporters/aggregate/test_sweep_exporters.py`:
- Around line 108-112: The comment inside the AggregateResult construction uses
the Unicode multiplication sign (`×`) which raises RUF003; update the comment in
test_sweep_exporters.py (near the AggregateResult instance in
test_sweep_exporters.py where result is defined) to use an ASCII character such
as "x" or an expression like "3 * 5" (e.g., change "3 values × 5 trials" to "3
values x 5 trials" or "3 * 5 trials") so the comment contains only ASCII
characters.

---

Duplicate comments:
In `@docs/tutorials/parameter-sweeping.md`:
- Around line 74-77: The other fenced output blocks are unlabeled (e.g., the
block containing "ValueError: Dashboard UI is not supported with parameter
sweeps..."); update every bare ``` output fence in this document to use the text
language identifier by replacing ``` with ```text so markdownlint MD040 warnings
are resolved and all unlabeled output blocks (including that error block) are
consistently labeled.
- Around line 236-334: The example JSON uses the old per_value_metrics map and
scalar pareto_optimal/best_configuration values; update the example to use the
current schema: replace "per_value_metrics" with a "per_combination_metrics"
array where each element includes a "parameters" object (e.g., {"concurrency":
10}) plus the same metric objects (request_throughput_avg, ttft_p99_ms), change
"pareto_optimal" from [10,30,40] to a list of parameter dictionaries (e.g.,
[{"concurrency":10}, {"concurrency":30}, {"concurrency":40}]), and change
entries under "best_configurations" (best_throughput, best_latency_p99) so their
"value" fields are parameter dictionaries instead of scalars; ensure metric
fields (mean, std, ci_low/ci_high) remain intact and example units stay the
same.

In `@tests/unit/orchestrator/test_parameter_sweep_properties.py`:
- Around line 9-58: The test helper make_run_result currently hardcodes
artifacts_path to Path(f"/tmp/{label}"), which is non-portable and triggers Ruff
S108; update make_run_result to accept an optional artifacts_path parameter (or
a tmp_path fixture) and use that when provided, otherwise derive a portable temp
directory via tempfile.gettempdir() (or Path(tmp_path) when caller supplies
tmp_path), then set RunResult(..., artifacts_path=that_path) so tests use a
portable temp directory instead of '/tmp'.

---

Nitpick comments:
In `@src/aiperf/exporters/aggregate/aggregate_sweep_csv_exporter.py`:
- Around line 51-56: There are two identical assignments to
per_combination_metrics in aggregate_sweep_csv_exporter.py; remove the duplicate
so only a single assignment remains (keep the line per_combination_metrics =
self._result.metrics and delete the repeated one) to avoid redundant code in the
method that builds the CSV export.

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dc45565 and a527fd4.

📒 Files selected for processing (14)
  • docs/api/sweep-aggregates.md
  • docs/tutorials/parameter-sweeping.md
  • src/aiperf/cli_runner.py
  • src/aiperf/exporters/aggregate/aggregate_sweep_csv_exporter.py
  • src/aiperf/exporters/aggregate/aggregate_sweep_json_exporter.py
  • src/aiperf/orchestrator/aggregation/__init__.py
  • src/aiperf/orchestrator/aggregation/sweep.py
  • src/aiperf/orchestrator/orchestrator.py
  • tests/integration/test_parameter_sweep.py
  • tests/unit/exporters/aggregate/test_sweep_exporters.py
  • tests/unit/orchestrator/aggregation/test_sweep.py
  • tests/unit/orchestrator/test_parameter_sweep_properties.py
  • tests/unit/plot/test_mode_detector.py
  • tests/unit/test_cli_runner_aggregation.py
Comment thread docs/api/sweep-aggregates.md
Comment on lines +68 to +85
# Extract metadata (excluding the sweep-specific sections)
metadata = {}
for key, value in self._result.metadata.items():
if key not in ["best_configurations", "pareto_optimal", "trends"]:
metadata[key] = value

output["metadata"] = metadata

# Add per_combination_metrics (stored in result.metrics)
output["per_combination_metrics"] = self._result.metrics

# Add sweep-specific sections from metadata
output["best_configurations"] = self._result.metadata.get(
"best_configurations", {}
)
output["pareto_optimal"] = self._result.metadata.get("pareto_optimal", [])

# Serialize to JSON with indentation

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve trends in JSON output instead of dropping it.

_generate_content() filters trends out of metadata but never writes it back to the payload, which silently discards trend analysis when present.

🔧 Suggested fix
         output["best_configurations"] = self._result.metadata.get(
             "best_configurations", {}
         )
         output["pareto_optimal"] = self._result.metadata.get("pareto_optimal", [])
+        output["trends"] = self._result.metadata.get("trends", {})
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Extract metadata (excluding the sweep-specific sections)
metadata = {}
for key, value in self._result.metadata.items():
if key not in ["best_configurations", "pareto_optimal", "trends"]:
metadata[key] = value
output["metadata"] = metadata
# Add per_combination_metrics (stored in result.metrics)
output["per_combination_metrics"] = self._result.metrics
# Add sweep-specific sections from metadata
output["best_configurations"] = self._result.metadata.get(
"best_configurations", {}
)
output["pareto_optimal"] = self._result.metadata.get("pareto_optimal", [])
# Serialize to JSON with indentation
# Extract metadata (excluding the sweep-specific sections)
metadata = {}
for key, value in self._result.metadata.items():
if key not in ["best_configurations", "pareto_optimal", "trends"]:
metadata[key] = value
output["metadata"] = metadata
# Add per_combination_metrics (stored in result.metrics)
output["per_combination_metrics"] = self._result.metrics
# Add sweep-specific sections from metadata
output["best_configurations"] = self._result.metadata.get(
"best_configurations", {}
)
output["pareto_optimal"] = self._result.metadata.get("pareto_optimal", [])
output["trends"] = self._result.metadata.get("trends", {})
# Serialize to JSON with indentation
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/exporters/aggregate/aggregate_sweep_json_exporter.py` around lines
68 - 85, The _generate_content() currently filters out "trends" from
self._result.metadata and never writes it back, causing trend data to be
dropped; update the logic in aggregate_sweep_json_exporter (inside
_generate_content) to preserve trends by either not removing "trends" when
building metadata or by explicitly adding output["trends"] =
self._result.metadata.get("trends", {}) (reference symbols: _generate_content,
self._result.metadata, output, "trends") so the serialized JSON includes trend
analysis.
Comment thread tests/unit/exporters/aggregate/test_sweep_exporters.py
@Lokiiiiii Lokiiiiii force-pushed the param-sweep branch 2 times, most recently from 7b4fa22 to 518e4a8 Compare February 24, 2026 06:27
ajcasagrande and others added 20 commits February 24, 2026 13:40
Some inference servers (e.g. vLLM with ignore_eos=True) emit raw newline bytes inside JSON string values in SSE data fields, violating both the SSE and JSON specs. This splits a single data field into a truncated data packet and a garbage packet, causing orjson parse failures.

Detect incomplete JSON objects in SSE data fields (starts with '{' but doesn't end with '}') and treat the next non-data line as a continuation, escaping the embedded newline as \n to produce valid JSON.
…dynamo#704)

Signed-off-by: Nicolas 'Pixel' Noble <nicolas@nobis-crew.org>
…ynamo#706)

Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
… hack (ai-dynamo#709)

Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
…ynamo#698)

Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Signed-off-by: Harrison Saturley-Hall <hsaturleyhal@nvidia.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
…marking (ai-dynamo#714)

Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Signed-off-by: Pavithra Vijayakrishnan <160681768+pvijayakrish@users.noreply.github.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
…-dynamo#723)

Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Signed-off-by: Lucas Vogel <lvogel@nvidia.com>
…namo#717)

Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
…rt (ai-dynamo#718)

Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Signed-off-by: lkomali <lkomali@nvidia.com>
…dynamo#721)

Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
…mo#727)

Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
…ector

Replaced hardcoded __pycache__ check with proper _is_run_directory()
validation in test_parent_and_child_paths_mixed test.

This makes the test more robust by using the same validation logic
as the production code, rather than relying on directory name patterns.

All 33 tests in test_mode_detector.py verified passing.

Signed-off-by: Loki Ravi <lokravi@amazon.com>
Added comprehensive unit tests for MultiRunOrchestrator aggregation
and export functionality:

- Test aggregation routing (_aggregate_and_export)
- Test confidence-only export (_export_confidence_aggregate)
- Test sweep export (_export_sweep_aggregates)
- Test aggregate_results for confidence and sweep modes
- Test _collect_failed_sweep_values with various scenarios

Coverage improvement:
- orchestrator.py: 49% → 64% (+15%)
- 14 new tests, all passing

Part of Phase 1 coverage improvement effort.

Signed-off-by: Loki Ravi <lokravi@amazon.com>
Renamed test functions to follow test_<function>_<scenario>_<expected> convention:
- test_aggregate_and_export_confidence_only_mode ��� test_aggregate_and_export_confidence_only_mode_calls_confidence_export
- test_aggregate_and_export_sweep_mode → test_aggregate_and_export_sweep_mode_calls_sweep_export
- test_aggregate_results_confidence_only_mode → test_aggregate_results_confidence_only_mode_returns_aggregate
- test_aggregate_results_sweep_mode → test_aggregate_results_sweep_mode_returns_per_value_and_sweep_aggregates

All tests passing: 6972 passed, 18 skipped

Signed-off-by: Loki Ravi <lokravi@amazon.com>
…truncation

Previously, passing [1.5, 2.5] would silently truncate to [1, 2] without
any error. This could lead to unexpected behavior where users think they're
testing concurrency 1.5 but actually testing concurrency 1.

Now explicitly checks if a float value is not an integer (using is_integer())
and raises a clear ValueError before attempting int() conversion.

Integer floats like 1.0, 2.0 still work correctly and are converted to ints.

Fixes code review finding: float truncation bug in loadgen_config.py line 59

Signed-off-by: Loki Ravi <lokravi@amazon.com>
…t behavior

The three tests in TestSweepParamsValidation were named '_raises_error' but
actually test that default values work correctly. Updated docstrings to
clarify that:

1. The validators DO work correctly (they check model_fields_set)
2. Testing explicit user setting is complex (requires simulating CLI behavior)
3. These tests verify default values don't cause issues

The validators in loadgen_config.py correctly check model_fields_set and
raise appropriate errors when users explicitly set sweep parameters with
single concurrency values. The test names remain unchanged to preserve
test history, but docstrings now accurately describe what's being tested.

Addresses code review finding about unreachable test code - the suggested
approach (model_construct + manual model_fields_set manipulation) doesn't
work well because other validators interfere during model_validate().

Signed-off-by: Loki Ravi <lokravi@amazon.com>
…g tutorial

Corrected the Pareto optimality analysis for concurrency values. Previously
stated that concurrency 30 dominates concurrency 20, which is mathematically
incorrect.

For Pareto dominance, a configuration must be better or equal on ALL
objectives. Concurrency 30 has:
- Better throughput: 245.3 vs 175.8 req/s ✓
- WORSE latency: 180.5ms vs 145.2ms ✗

Since latency is worse, concurrency 30 does NOT dominate concurrency 20.
Therefore, concurrency 20 IS Pareto optimal.

Updated:
- Explanation text to correctly state concurrency 20 is Pareto optimal
- JSON example to include concurrency 20 in pareto_optimal array
- Added description of why concurrency 20 is Pareto optimal

Fixes code review finding: incorrect Pareto dominance in parameter-sweeping.md

Signed-off-by: Loki Ravi <lokravi@amazon.com>
Updated sweep-aggregates.md to document both directory structures:

1. Independent mode (sweep-only): sweep_aggregate/ at top level
2. Repeated mode (sweep + trials): aggregate/sweep_aggregate/

Previously only showed independent mode structure, which was confusing
for users running sweeps with --num-profile-runs > 1.

Now clearly shows:
- When each structure is used
- Example directory trees for both modes
- Location of per-value confidence aggregates in repeated mode

Fixes code review finding: directory structure clarity in sweep-aggregates.md

Signed-off-by: Loki Ravi <lokravi@amazon.com>
Removed two inline comments that simply restated what the code does:
- '# Execute runs' before self.execute()
- '# Aggregate and export results' before self._aggregate_and_export()

These comments don't add value since the method names are self-explanatory.
The method docstring already documents the overall flow.

Addresses code review nitpick: redundant comments in orchestrator.py lines 79-83

Signed-off-by: Loki Ravi <lokravi@amazon.com>
- Add comprehensive test coverage for sweep aggregation computation
- Test AggregateResult creation and metadata handling
- Test parameter detection and metric conversion
- Test failed run counting and error handling
- Test best configurations and Pareto optimal storage
- Test trials per value calculation
- All 9 tests passing

Signed-off-by: Loki Ravi <lokravi@amazon.com>
- orchestrator.py coverage improved from 62% to 63% (+1%)
- Overall improvement from baseline: 49% → 63% (+14%)
- Phase 1 item 4 completed: test_orchestrator_sweep_aggregation.py
- All 9 sweep aggregation tests passing

Signed-off-by: Loki Ravi <lokravi@amazon.com>
- test_orchestrator_aggregation.py already has comprehensive tests
- All 14 tests passing for export methods
- Covers _aggregate_and_export, _export_confidence_aggregate, _export_sweep_aggregates
- Per-value aggregate export tested within sweep export tests

Signed-off-by: Loki Ravi <lokravi@amazon.com>
- Removed 50+ lines of commented-out test code
- Replaced with concise NOTE comment explaining removal
- analyze_trends() function was removed from design
- All 27 tests passing, 10 skipped

Signed-off-by: Loki Ravi <lokravi@amazon.com>
- Remove Property 8 (Repeated Mode Execution) test stubs
- Remove Property 9 (Independent Mode Execution) test stubs
- Remove Property 16 (Backward Compatibility) test stubs
- Remove Property 17 (UI Mode Validation) test stubs
- Add notes indicating tests moved to integration tests or deferred
- Keeps unit test file focused on property-based unit tests

Signed-off-by: Loki Ravi <lokravi@amazon.com>
Update sweep aggregate metadata example to match actual implementation:
- Change 'num_values' to 'num_combinations'
- Change 'num_successful_values' to 'num_successful_runs'
- Change 'failed_values' to 'failed_runs'
- Update failed_runs structure to use 'run_index' instead of 'value'
- Remove 'timestamp' field (not in actual output)

This aligns documentation with the actual sweep aggregate JSON output.

Signed-off-by: Loki Ravi <lokravi@amazon.com>
Fix remaining inconsistencies in Pareto frontier documentation:
- Add concurrency 20 to pareto_optimal array in JSON example
- Update ASCII chart to show concurrency 20 as ● (optimal) not ○ (dominated)
- Update chart caption to reflect all points are Pareto optimal
- Add concurrency 20 to best practices list

This completes the fix started in commit 4b1459a, ensuring all
documentation consistently shows concurrency 20 as Pareto optimal.

Signed-off-by: Loki Ravi <lokravi@amazon.com>
Rename three tests to accurately reflect their behavior:
- test_parameter_sweep_mode_with_single_concurrency_raises_error
  → test_parameter_sweep_mode_with_single_concurrency_defaults_to_repeated
- test_parameter_sweep_cooldown_with_single_concurrency_raises_error
  → test_parameter_sweep_cooldown_with_single_concurrency_defaults_to_zero
- test_parameter_sweep_same_seed_with_single_concurrency_raises_error
  → test_parameter_sweep_same_seed_with_single_concurrency_defaults_to_false

These tests verify that default values work correctly with single
concurrency, not that errors are raised. The previous names were
misleading and introduced in commit 4ddfc6f.

Signed-off-by: Loki Ravi <lokravi@amazon.com>
Fix _create_confidence_strategy to respect user configuration instead
of hardcoding parameters:
- Use config.loadgen.set_consistent_seed instead of hardcoded True
- Use config.loadgen.profile_run_disable_warmup_after_first instead of hardcoded True

This ensures user settings are honored when creating confidence/fixed
trials strategies, allowing users to control seed consistency and
warmup behavior across multiple profile runs.

Signed-off-by: Loki Ravi <lokravi@amazon.com>
Add None check before including sweep_aggregate in results dict to
prevent AttributeError when sweep aggregation fails.

When _compute_sweep_aggregates returns None (e.g., when parameter name
cannot be determined from results), only include per_value_aggregates
in the returned dict. This prevents downstream code from attempting to
export or process a None sweep_aggregate, which would cause crashes.

The function already logs a warning when returning None, so users are
informed when sweep aggregation fails.

Signed-off-by: Loki Ravi <lokravi@amazon.com>
Changed metadata updates in _compose_trials_then_sweep and
_compose_sweep_then_trials to use dynamic parameter name from
sweep_strategy.parameter_name instead of hardcoded 'concurrency'.

This fixes issue where request_rate and other sweep parameters
would fail or produce incorrect results due to hardcoded metadata keys.

Fixes: ai-dynamo#35
Signed-off-by: Loki Ravi <lokravi@amazon.com>
Renamed three test functions to follow test_<function>_<scenario>_<expected>
convention for better clarity:

- test_single_value_remains_integer ->
  test_loadgen_concurrency_single_int_assignment_preserves_int_type
- test_both_cooldowns_independent ->
  test_get_cooldown_seconds_independent_strategies_return_respective_values
- test_zero_cooldown_default ->
  test_get_cooldown_seconds_no_cooldown_specified_returns_zero

Addresses: ai-dynamo#49, ai-dynamo#50
Signed-off-by: Loki Ravi <lokravi@amazon.com>
Added 'text' and 'bash' language identifiers to fenced code blocks
that were missing them, improving markdown linting compliance:

- Directory structure blocks now use 'text'
- Execution pattern diagrams now use 'text'
- Pareto frontier ASCII chart now uses 'text'
- Command examples now use 'bash'

Addresses: ai-dynamo#38, ai-dynamo#41, ai-dynamo#51
Signed-off-by: Loki Ravi <lokravi@amazon.com>
Removed dead code - empty test class that had no test methods.
The class was a placeholder for trend analysis tests that were
removed when the analyze_trends() function was removed from design.

Addresses: ai-dynamo#48
Signed-off-by: Loki Ravi <lokravi@amazon.com>
Replaced random.randint with Hypothesis data.draw() to ensure
deterministic shrinking and reproducibility in property-based tests.

The test now uses st.data() strategy to draw the insertion position,
allowing Hypothesis to control all randomness and provide better
shrinking when failures occur.

Addresses: ai-dynamo#25, ai-dynamo#37, ai-dynamo#46
Signed-off-by: Loki Ravi <lokravi@amazon.com>
Removed outdated notes about needing CLI/cyclopts to populate
model_fields_set. Updated docstrings to correctly explain that
Pydantic populates model_fields_set automatically when passing
fields to the LoadGeneratorConfig constructor.

Addresses: ai-dynamo#42
Signed-off-by: Loki Ravi <lokravi@amazon.com>
Removed comments that simply restated what the code does
(e.g., '# Aggregate results', '# Export aggregates', '# No aggregation needed',
'# Confidence-only mode', '# Sweep + confidence mode').

The code structure and method names are self-documenting, making
these comments unnecessary. The docstring already explains the
method's purpose.

Addresses: ai-dynamo#44
Signed-off-by: Loki Ravi <lokravi@amazon.com>
- Test auto-detection of execution modes (single, sweep, confidence, combined)
- Test strategy composition in repeated and independent modes
- Test execution order verification for composed strategies
- Test cooldown application between runs
- Test failed sweep value logging
- Test strategy creation and validation
- 15 tests covering orchestrator execution flow

Signed-off-by: Loki Ravi <lokravi@amazon.com>
Migrate aggregation, export, tagging, and failure collection from
orchestrator methods to strategy abstract methods. Each strategy
(FixedTrials, ParameterSweep, SweepConfidence) now owns its full
lifecycle. SweepConfidenceStrategy uses Option B (strategy owns
its execution loop) for nested sweep+confidence iteration.

Key changes:
- Extract export_confidence/export_sweep to export_helpers module
- Add abstract methods: aggregate, export_aggregates, tag_result,
  collect_failed_values, execute to ExecutionStrategy base class
- Add SweepConfidenceStrategy with nested loop execution (Option B)
- Add SweepMode enum for repeated/independent execution order
- Reduce orchestrator from 1296 to 453 lines (thin coordinator)
- Consolidate UI validation into single _validate_ui_compatibility
  that checks both is_multi_run and is_sweep
- Remove duplicate dashboard UI check from _run_multi_benchmark
- Add else guard in _run_multi_benchmark for impossible code path
- Delete test_orchestrator_sweep_aggregation.py (covered by
  test_cli_runner_aggregation.py)
- Remove 4 dead tests for deleted _collect_failed_sweep_values

Signed-off-by: Loki Ravi <lokravi@amazon.com>
1. Strip parameter_sweep_cooldown_seconds and parameter_sweep_same_seed
   from inner subprocess config (ship-blocker: these fields caused every
   sweep run to fail with ValidationError when either flag was set)

2. Reject duplicate values in --concurrency list to prevent silent
   artifact overwrites (e.g., --concurrency 2,2,2 would write all runs
   to the same concurrency_2/ directory). Error message guides users
   to --num-profile-runs for variance at a single concurrency.

3. Remove duplicated banner separator in _run_multi_benchmark (rebase
   artifact that doubled the = separator line)

4. Rename SweepAggregation to SweepAnalyzer to avoid implying it
   conforms to the AggregationStrategy ABC (it uses a different
   protocol: static compute() vs instance aggregate())

5. Fix artifact directory path containing Python list literal
   (concurrency[2, 4, 8]) by generating concurrency_sweep_2_4_8
   for list inputs, avoiding shell-hostile brackets and spaces

Signed-off-by: Loki Ravi <lokravi@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment