Skip to content

fix(config): auto-detect BurstGPT CSV in DatasetResolver and pin fixed_schedule regression#984

Merged
lkomali merged 4 commits into
mainfrom
lkomali/burst-gpt-fixed-schedule-resolver
May 22, 2026
Merged

fix(config): auto-detect BurstGPT CSV in DatasetResolver and pin fixed_schedule regression#984
lkomali merged 4 commits into
mainfrom
lkomali/burst-gpt-fixed-schedule-resolver

Conversation

@lkomali

@lkomali lkomali commented May 22, 2026

Copy link
Copy Markdown
Contributor

Summary

  • DatasetResolver._detect_type bailed out of structural detection when the first line of an input file wasn't valid JSON. BurstGPT is the only CSV-format loader in the tree, so dropping a bare BurstGPT CSV (no --custom-dataset-type) tripped the ValueError on the CSV header and returned (None, None) — even though BurstGPTTraceDatasetLoader.can_load recognizes the format from its header columns on its own. Extracted a small _read_first_jsonl_record helper that returns None (instead of bailing) on a non-JSON first line so loaders' filename-based can_load checks still get a chance. OSError still bails the whole detection — that case genuinely cannot proceed.
  • Added regression coverage that was missing for burst_gpt_trace + --fixed-schedule: one integration test with explicit --custom-dataset-type, one with auto-detect (the second pins the _detect_type fix specifically). Plus two unit tests in test_resolver_edge_cases.py covering the _detect_type contract and the full-resolver has_timing pass for a bare CSV.

Note: the related _check_timing_data fix for BurstGPT already shipped to main in 83bdcd92 ("fix the nightly..."). That fix is still missing on release/0.9.0, where the user-reported bug was reproduced .

Summary by CodeRabbit

  • Bug Fixes

    • Improved dataset auto-detection to avoid mis-parsing CSV headers as JSON, handle unreadable or non-JSON-first-line files gracefully, and restore fixed-schedule runs when dataset type isn’t provided.
  • Tests

    • Added integration and unit tests plus a test utility to validate BurstGPT-style CSV detection, timing behavior, and end-to-end fixed-schedule execution.

Review Change Stack

…d_schedule regression

Signed-off-by: lkomali <lkomali@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented May 22, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions

github-actions Bot commented May 22, 2026

Copy link
Copy Markdown

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@75b717a1425ecb7a0591b91c2daf850e6c56f559

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@75b717a1425ecb7a0591b91c2daf850e6c56f559

Last updated for commit: 75b717aBrowse code

@coderabbitai

coderabbitai Bot commented May 22, 2026

Copy link
Copy Markdown
Contributor

Walkthrough

Refactors DatasetResolver JSONL probing into _read_first_jsonl_record, updates _detect_type to use it (only catching OSError), and adds a CSV fixture helper plus unit and integration tests covering BurstGPT CSV detection and fixed-schedule execution.

Changes

Dataset Type Detection and CSV Handling

Layer / File(s) Summary
JSONL parsing and dataset type detection refactoring
src/aiperf/config/dataset/resolver.py
New _read_first_jsonl_record(file_path) returns the first parsed JSON dict or None for empty/non-JSON first lines and lets OSError propagate. _detect_type now uses this helper and only catches OSError for non-directory paths.
Test fixture utility for BurstGPT CSV
tests/integration/utils.py
create_burst_gpt_csv_file writes a BurstGPT-style CSV using the first row's keys as header, validates non-empty rows, and returns the created file path.
Integration and unit tests for resolver CSV detection
tests/integration/test_burst_gpt_trace.py, tests/unit/config/test_resolver_edge_cases.py
Integration tests exercise fixed-schedule with explicit and auto-detected burst_gpt_trace. Unit tests verify _detect_type classifies BurstGPT CSVs when the first line is a CSV header, that probing does not log JSON parse warnings, that _read_first_jsonl_record returns None for binary/non-object/empty files, and that end-to-end resolver marks dataset timing data.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes


Poem

🐰 I sniffed a header, plain and shy,
Not JSON—so I let it lie.
A helper reads the very first line,
Returns None when JSON won't align.
Now schedules hop along in time.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: fixing auto-detection of BurstGPT CSV files in DatasetResolver and adding regression coverage for the fixed_schedule issue.
Docstring Coverage ✅ Passed Docstring coverage is 94.12% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/integration/test_burst_gpt_trace.py (1)

110-112: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assert outputs in the auto-detect path as well.

The explicit-type test checks has_all_outputs, but the auto-detect regression test currently does not. Adding the same assertion improves end-to-end parity.

💡 Proposed fix
         assert result.exit_code == 0
         assert result.request_count == len(rows)
+        assert result.has_all_outputs
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/test_burst_gpt_trace.py` around lines 110 - 112, The
auto-detect regression test is missing the parity check for outputs: after the
existing assertions (assert result.exit_code == 0 and assert
result.request_count == len(rows)) add the same end-to-end output assertion used
in the explicit-type test by asserting result.has_all_outputs to ensure all
expected outputs were produced; locate this in the auto-detect test in
tests/integration/test_burst_gpt_trace.py and insert the assert for
result.has_all_outputs immediately after the other assertions.
🧹 Nitpick comments (2)
tests/integration/test_burst_gpt_trace.py (1)

46-51: ⚡ Quick win

Add explicit return type hints to async test methods.

Both async test functions should declare -> None to match the repo-wide typing rule.

💡 Proposed fix
     async def test_fixed_schedule_with_explicit_dataset_type(
         self,
         cli: AIPerfCLI,
         aiperf_mock_server: AIPerfMockServer,
         tmp_path: Path,
-    ):
+    ) -> None:
@@
     async def test_fixed_schedule_auto_detected(
         self,
         cli: AIPerfCLI,
         aiperf_mock_server: AIPerfMockServer,
         tmp_path: Path,
-    ):
+    ) -> None:

As per coding guidelines: Type hints on ALL functions (parameters and return types).

Also applies to: 81-86

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/test_burst_gpt_trace.py` around lines 46 - 51, Add explicit
return type hints "-> None" to the async test function definitions: update the
signature of test_fixed_schedule_with_explicit_dataset_type and the other async
test function in this file to declare their return type as None (e.g., "async
def test_fixed_schedule_with_explicit_dataset_type(... ) -> None:") to satisfy
the repository typing rule for all functions.
tests/integration/utils.py (1)

136-145: ⚡ Quick win

Validate required BurstGPT columns in the fixture helper.

The helper docstring promises required columns, but that contract is not enforced. Add explicit validation so malformed fixtures fail early with clear errors.

💡 Proposed fix
     if not rows:
         raise ValueError("rows must not be empty")
 
-    fieldnames = list(rows[0].keys())
+    required_columns = {"Timestamp", "Request tokens", "Response tokens"}
+    first_keys = set(rows[0].keys())
+    missing = required_columns - first_keys
+    if missing:
+        raise ValueError(
+            f"rows[0] missing required BurstGPT columns: {sorted(missing)}"
+        )
+
+    fieldnames = list(rows[0].keys())
+    fieldnames_set = set(fieldnames)
     csv_path = tmp_path / filename
     with open(csv_path, "w", newline="", encoding="utf-8") as f:
         writer = csv.DictWriter(f, fieldnames=fieldnames)
         writer.writeheader()
         for row in rows:
+            extra = set(row.keys()) - fieldnames_set
+            if extra:
+                raise ValueError(
+                    f"row has unexpected columns not present in header: {sorted(extra)}"
+                )
             writer.writerow(row)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/utils.py` around lines 136 - 145, The CSV fixture helper
currently accepts rows without enforcing the required BurstGPT columns; add an
explicit validation after computing fieldnames to ensure required_columns
(define a list like required_columns = ["<COLUMN1>", "<COLUMN2>", ...] replaced
with the actual BurstGPT-required column names) are all present in fieldnames,
and if any are missing raise a ValueError that lists the missing columns; apply
this check in the same function around the rows/fieldnames variables (before
opening csv_path and before creating the DictWriter) so malformed fixtures fail
fast with a clear error message.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/aiperf/config/dataset/resolver.py`:
- Around line 225-245: The helper _read_first_jsonl_record currently returns any
JSON value which can be a non-dict and later break callers that expect to call
.get(); update _read_first_jsonl_record to parse the first non-blank line with
load_json_str but only return the parsed value when it is an instance of dict,
otherwise return None; keep the existing behavior of propagating OSError and
still returning None for non-JSON first lines by catching ValueError as before.

---

Outside diff comments:
In `@tests/integration/test_burst_gpt_trace.py`:
- Around line 110-112: The auto-detect regression test is missing the parity
check for outputs: after the existing assertions (assert result.exit_code == 0
and assert result.request_count == len(rows)) add the same end-to-end output
assertion used in the explicit-type test by asserting result.has_all_outputs to
ensure all expected outputs were produced; locate this in the auto-detect test
in tests/integration/test_burst_gpt_trace.py and insert the assert for
result.has_all_outputs immediately after the other assertions.

---

Nitpick comments:
In `@tests/integration/test_burst_gpt_trace.py`:
- Around line 46-51: Add explicit return type hints "-> None" to the async test
function definitions: update the signature of
test_fixed_schedule_with_explicit_dataset_type and the other async test function
in this file to declare their return type as None (e.g., "async def
test_fixed_schedule_with_explicit_dataset_type(... ) -> None:") to satisfy the
repository typing rule for all functions.

In `@tests/integration/utils.py`:
- Around line 136-145: The CSV fixture helper currently accepts rows without
enforcing the required BurstGPT columns; add an explicit validation after
computing fieldnames to ensure required_columns (define a list like
required_columns = ["<COLUMN1>", "<COLUMN2>", ...] replaced with the actual
BurstGPT-required column names) are all present in fieldnames, and if any are
missing raise a ValueError that lists the missing columns; apply this check in
the same function around the rows/fieldnames variables (before opening csv_path
and before creating the DictWriter) so malformed fixtures fail fast with a clear
error message.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6b868e94-47bf-4ecb-929f-458b702e3827

📥 Commits

Reviewing files that changed from the base of the PR and between bd5467f and 95e1abd.

📒 Files selected for processing (4)
  • src/aiperf/config/dataset/resolver.py
  • tests/integration/test_burst_gpt_trace.py
  • tests/integration/utils.py
  • tests/unit/config/test_resolver_edge_cases.py
Comment thread src/aiperf/config/dataset/resolver.py
@codecov

codecov Bot commented May 22, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 94.73684% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/aiperf/config/dataset/resolver.py 94.73% 0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Comment thread src/aiperf/config/dataset/resolver.py Outdated

@ajcasagrande ajcasagrande left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix looks functionally sound: the new resolver path lets BurstGPT CSVs reach structural loader detection, and both targeted integration tests plus a manual aiperf profile run against the in-repo mock server pass. I found one low-severity cleanup item: the expected non-JSON CSV probe still goes through load_json_str, so successful auto-detection emits a scary JSON parse warning before continuing. I’d fix that first; no correctness blocker found.

Comment thread src/aiperf/config/dataset/resolver.py Outdated
…JSON-parse warning

Signed-off-by: lkomali <lkomali@nvidia.com>
@lkomali lkomali requested a review from ajcasagrande May 22, 2026 18:47
…SV probe

Signed-off-by: lkomali <lkomali@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/aiperf/config/dataset/resolver.py`:
- Around line 246-248: The file-reading in
DatasetResolver._read_first_jsonl_record currently calls open(file_path) which
uses the system locale and can cause inconsistent UnicodeDecodeError; change the
open call to explicitly pass encoding="utf-8" (e.g., open(file_path,
encoding="utf-8")) so the probe decoding is pinned to UTF‑8 when reading the
first JSONL record and subsequent lines.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 65684e59-46fb-415b-b764-ca0c3090b0c0

📥 Commits

Reviewing files that changed from the base of the PR and between d7b6f43 and a27a7b9.

📒 Files selected for processing (3)
  • src/aiperf/config/dataset/resolver.py
  • tests/integration/test_burst_gpt_trace.py
  • tests/unit/config/test_resolver_edge_cases.py
Comment thread src/aiperf/config/dataset/resolver.py

@ajcasagrande ajcasagrande left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed for correctness against origin/main; no blocking findings. Focused resolver unit coverage passed, and the new BurstGPT integration tests passed when selected with the integration marker.

…minism

Signed-off-by: lkomali <lkomali@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
src/aiperf/config/dataset/resolver.py (1)

246-247: 🏗️ Heavy lift

Avoid extending the synchronous file-I/O path here.

open(...) keeps the JSONL probe on blocking I/O in the exact path this PR touches. Please route this read through an async helper instead of adding another synchronous file access point in DatasetResolver.

As per coding guidelines, **/*.py: Use async/await for ALL I/O operations - no time.sleep, no blocking calls.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/aiperf/config/dataset/resolver.py` around lines 246 - 247, The
synchronous open(...) call in DatasetResolver (the method that probes/parses the
JSONL at file_path) must be replaced with an async file-read helper so we avoid
blocking I/O; modify the method to await an async reader (e.g., use an existing
project helper like async_read_text or use aiofiles.open within an async
context) to read the file contents instead of using open(...), keep the rest of
the parsing logic identical, and ensure the method signature remains async so
callers can await it.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/aiperf/config/dataset/resolver.py`:
- Around line 246-247: The synchronous open(...) call in DatasetResolver (the
method that probes/parses the JSONL at file_path) must be replaced with an async
file-read helper so we avoid blocking I/O; modify the method to await an async
reader (e.g., use an existing project helper like async_read_text or use
aiofiles.open within an async context) to read the file contents instead of
using open(...), keep the rest of the parsing logic identical, and ensure the
method signature remains async so callers can await it.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2af67c68-4ff7-4371-892e-334e999cb020

📥 Commits

Reviewing files that changed from the base of the PR and between a27a7b9 and 75b717a.

📒 Files selected for processing (1)
  • src/aiperf/config/dataset/resolver.py

@dynamo-ops dynamo-ops left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previous review comments have been addressed. Approving.

@lkomali lkomali merged commit 9fc0b76 into main May 22, 2026
20 checks passed
@lkomali lkomali deleted the lkomali/burst-gpt-fixed-schedule-resolver branch May 22, 2026 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 participants