fix(config): auto-detect BurstGPT CSV in DatasetResolver and pin fixed_schedule regression by lkomali · Pull Request #984 · ai-dynamo/aiperf

lkomali · 2026-05-22T18:22:21Z

Summary

DatasetResolver._detect_type bailed out of structural detection when the first line of an input file wasn't valid JSON. BurstGPT is the only CSV-format loader in the tree, so dropping a bare BurstGPT CSV (no --custom-dataset-type) tripped the ValueError on the CSV header and returned (None, None) — even though BurstGPTTraceDatasetLoader.can_load recognizes the format from its header columns on its own. Extracted a small _read_first_jsonl_record helper that returns None (instead of bailing) on a non-JSON first line so loaders' filename-based can_load checks still get a chance. OSError still bails the whole detection — that case genuinely cannot proceed.
Added regression coverage that was missing for burst_gpt_trace + --fixed-schedule: one integration test with explicit --custom-dataset-type, one with auto-detect (the second pins the _detect_type fix specifically). Plus two unit tests in test_resolver_edge_cases.py covering the _detect_type contract and the full-resolver has_timing pass for a bare CSV.

Note: the related _check_timing_data fix for BurstGPT already shipped to main in 83bdcd92 ("fix the nightly..."). That fix is still missing on release/0.9.0, where the user-reported bug was reproduced .

Summary by CodeRabbit

Bug Fixes
- Improved dataset auto-detection to avoid mis-parsing CSV headers as JSON, handle unreadable or non-JSON-first-line files gracefully, and restore fixed-schedule runs when dataset type isn’t provided.
Tests
- Added integration and unit tests plus a test utility to validate BurstGPT-style CSV detection, timing behavior, and end-to-end fixed-schedule execution.

…d_schedule regression Signed-off-by: lkomali <lkomali@nvidia.com>

copy-pr-bot · 2026-05-22T18:22:25Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

github-actions · 2026-05-22T18:22:30Z

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@75b717a1425ecb7a0591b91c2daf850e6c56f559

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@75b717a1425ecb7a0591b91c2daf850e6c56f559

Last updated for commit: 75b717a • Browse code

coderabbitai · 2026-05-22T18:27:46Z

Walkthrough

Refactors DatasetResolver JSONL probing into _read_first_jsonl_record, updates _detect_type to use it (only catching OSError), and adds a CSV fixture helper plus unit and integration tests covering BurstGPT CSV detection and fixed-schedule execution.

Changes

Dataset Type Detection and CSV Handling

Layer / File(s)	Summary
JSONL parsing and dataset type detection refactoring `src/aiperf/config/dataset/resolver.py`	New `_read_first_jsonl_record(file_path)` returns the first parsed JSON dict or `None` for empty/non-JSON first lines and lets `OSError` propagate. `_detect_type` now uses this helper and only catches `OSError` for non-directory paths.
Test fixture utility for BurstGPT CSV `tests/integration/utils.py`	`create_burst_gpt_csv_file` writes a BurstGPT-style CSV using the first row's keys as header, validates non-empty rows, and returns the created file path.
Integration and unit tests for resolver CSV detection `tests/integration/test_burst_gpt_trace.py`, `tests/unit/config/test_resolver_edge_cases.py`	Integration tests exercise fixed-schedule with explicit and auto-detected `burst_gpt_trace`. Unit tests verify `_detect_type` classifies BurstGPT CSVs when the first line is a CSV header, that probing does not log JSON parse warnings, that `_read_first_jsonl_record` returns `None` for binary/non-object/empty files, and that end-to-end resolver marks dataset timing data.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I sniffed a header, plain and shy,
Not JSON—so I let it lie.
A helper reads the very first line,
Returns None when JSON won't align.
Now schedules hop along in time.

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main changes: fixing auto-detection of BurstGPT CSV files in DatasetResolver and adding regression coverage for the fixed_schedule issue.
Docstring Coverage	✅ Passed	Docstring coverage is 94.12% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

tests/integration/test_burst_gpt_trace.py (1)

110-112: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Assert outputs in the auto-detect path as well.

The explicit-type test checks has_all_outputs, but the auto-detect regression test currently does not. Adding the same assertion improves end-to-end parity.

💡 Proposed fix

         assert result.exit_code == 0
         assert result.request_count == len(rows)
+        assert result.has_all_outputs

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/test_burst_gpt_trace.py` around lines 110 - 112, The
auto-detect regression test is missing the parity check for outputs: after the
existing assertions (assert result.exit_code == 0 and assert
result.request_count == len(rows)) add the same end-to-end output assertion used
in the explicit-type test by asserting result.has_all_outputs to ensure all
expected outputs were produced; locate this in the auto-detect test in
tests/integration/test_burst_gpt_trace.py and insert the assert for
result.has_all_outputs immediately after the other assertions.

🧹 Nitpick comments (2)

tests/integration/test_burst_gpt_trace.py (1)

46-51: ⚡ Quick win

Add explicit return type hints to async test methods.

Both async test functions should declare -> None to match the repo-wide typing rule.

💡 Proposed fix

     async def test_fixed_schedule_with_explicit_dataset_type(
         self,
         cli: AIPerfCLI,
         aiperf_mock_server: AIPerfMockServer,
         tmp_path: Path,
-    ):
+    ) -> None:
@@
     async def test_fixed_schedule_auto_detected(
         self,
         cli: AIPerfCLI,
         aiperf_mock_server: AIPerfMockServer,
         tmp_path: Path,
-    ):
+    ) -> None:

As per coding guidelines: Type hints on ALL functions (parameters and return types).

Also applies to: 81-86

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/test_burst_gpt_trace.py` around lines 46 - 51, Add explicit
return type hints "-> None" to the async test function definitions: update the
signature of test_fixed_schedule_with_explicit_dataset_type and the other async
test function in this file to declare their return type as None (e.g., "async
def test_fixed_schedule_with_explicit_dataset_type(... ) -> None:") to satisfy
the repository typing rule for all functions.

tests/integration/utils.py (1)

136-145: ⚡ Quick win

Validate required BurstGPT columns in the fixture helper.

The helper docstring promises required columns, but that contract is not enforced. Add explicit validation so malformed fixtures fail early with clear errors.

💡 Proposed fix

     if not rows:
         raise ValueError("rows must not be empty")
 
-    fieldnames = list(rows[0].keys())
+    required_columns = {"Timestamp", "Request tokens", "Response tokens"}
+    first_keys = set(rows[0].keys())
+    missing = required_columns - first_keys
+    if missing:
+        raise ValueError(
+            f"rows[0] missing required BurstGPT columns: {sorted(missing)}"
+        )
+
+    fieldnames = list(rows[0].keys())
+    fieldnames_set = set(fieldnames)
     csv_path = tmp_path / filename
     with open(csv_path, "w", newline="", encoding="utf-8") as f:
         writer = csv.DictWriter(f, fieldnames=fieldnames)
         writer.writeheader()
         for row in rows:
+            extra = set(row.keys()) - fieldnames_set
+            if extra:
+                raise ValueError(
+                    f"row has unexpected columns not present in header: {sorted(extra)}"
+                )
             writer.writerow(row)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/utils.py` around lines 136 - 145, The CSV fixture helper
currently accepts rows without enforcing the required BurstGPT columns; add an
explicit validation after computing fieldnames to ensure required_columns
(define a list like required_columns = ["<COLUMN1>", "<COLUMN2>", ...] replaced
with the actual BurstGPT-required column names) are all present in fieldnames,
and if any are missing raise a ValueError that lists the missing columns; apply
this check in the same function around the rows/fieldnames variables (before
opening csv_path and before creating the DictWriter) so malformed fixtures fail
fast with a clear error message.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/aiperf/config/dataset/resolver.py`:
- Around line 225-245: The helper _read_first_jsonl_record currently returns any
JSON value which can be a non-dict and later break callers that expect to call
.get(); update _read_first_jsonl_record to parse the first non-blank line with
load_json_str but only return the parsed value when it is an instance of dict,
otherwise return None; keep the existing behavior of propagating OSError and
still returning None for non-JSON first lines by catching ValueError as before.

---

Outside diff comments:
In `@tests/integration/test_burst_gpt_trace.py`:
- Around line 110-112: The auto-detect regression test is missing the parity
check for outputs: after the existing assertions (assert result.exit_code == 0
and assert result.request_count == len(rows)) add the same end-to-end output
assertion used in the explicit-type test by asserting result.has_all_outputs to
ensure all expected outputs were produced; locate this in the auto-detect test
in tests/integration/test_burst_gpt_trace.py and insert the assert for
result.has_all_outputs immediately after the other assertions.

---

Nitpick comments:
In `@tests/integration/test_burst_gpt_trace.py`:
- Around line 46-51: Add explicit return type hints "-> None" to the async test
function definitions: update the signature of
test_fixed_schedule_with_explicit_dataset_type and the other async test function
in this file to declare their return type as None (e.g., "async def
test_fixed_schedule_with_explicit_dataset_type(... ) -> None:") to satisfy the
repository typing rule for all functions.

In `@tests/integration/utils.py`:
- Around line 136-145: The CSV fixture helper currently accepts rows without
enforcing the required BurstGPT columns; add an explicit validation after
computing fieldnames to ensure required_columns (define a list like
required_columns = ["<COLUMN1>", "<COLUMN2>", ...] replaced with the actual
BurstGPT-required column names) are all present in fieldnames, and if any are
missing raise a ValueError that lists the missing columns; apply this check in
the same function around the rows/fieldnames variables (before opening csv_path
and before creating the DictWriter) so malformed fixtures fail fast with a clear
error message.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6b868e94-47bf-4ecb-929f-458b702e3827

📥 Commits

Reviewing files that changed from the base of the PR and between bd5467f and 95e1abd.

📒 Files selected for processing (4)

src/aiperf/config/dataset/resolver.py
tests/integration/test_burst_gpt_trace.py
tests/integration/utils.py
tests/unit/config/test_resolver_edge_cases.py

codecov · 2026-05-22T18:31:20Z

Codecov Report

❌ Patch coverage is 94.73684% with 1 line in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/aiperf/config/dataset/resolver.py	94.73%	0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

ajcasagrande

The fix looks functionally sound: the new resolver path lets BurstGPT CSVs reach structural loader detection, and both targeted integration tests plus a manual aiperf profile run against the in-repo mock server pass. I found one low-severity cleanup item: the expected non-JSON CSV probe still goes through load_json_str, so successful auto-detection emits a scary JSON parse warning before continuing. I’d fix that first; no correctness blocker found.

…JSON-parse warning Signed-off-by: lkomali <lkomali@nvidia.com>

…SV probe Signed-off-by: lkomali <lkomali@nvidia.com>

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/aiperf/config/dataset/resolver.py`:
- Around line 246-248: The file-reading in
DatasetResolver._read_first_jsonl_record currently calls open(file_path) which
uses the system locale and can cause inconsistent UnicodeDecodeError; change the
open call to explicitly pass encoding="utf-8" (e.g., open(file_path,
encoding="utf-8")) so the probe decoding is pinned to UTF‑8 when reading the
first JSONL record and subsequent lines.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 65684e59-46fb-415b-b764-ca0c3090b0c0

📥 Commits

Reviewing files that changed from the base of the PR and between d7b6f43 and a27a7b9.

📒 Files selected for processing (3)

src/aiperf/config/dataset/resolver.py
tests/integration/test_burst_gpt_trace.py
tests/unit/config/test_resolver_edge_cases.py

ajcasagrande

Reviewed for correctness against origin/main; no blocking findings. Focused resolver unit coverage passed, and the new BurstGPT integration tests passed when selected with the integration marker.

…minism Signed-off-by: lkomali <lkomali@nvidia.com>

coderabbitai

🧹 Nitpick comments (1)

src/aiperf/config/dataset/resolver.py (1)
246-247: 🏗️ Heavy lift

Avoid extending the synchronous file-I/O path here.

open(...) keeps the JSONL probe on blocking I/O in the exact path this PR touches. Please route this read through an async helper instead of adding another synchronous file access point in DatasetResolver.

As per coding guidelines, **/*.py: Use async/await for ALL I/O operations - no time.sleep, no blocking calls.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/aiperf/config/dataset/resolver.py` around lines 246 - 247, The
synchronous open(...) call in DatasetResolver (the method that probes/parses the
JSONL at file_path) must be replaced with an async file-read helper so we avoid
blocking I/O; modify the method to await an async reader (e.g., use an existing
project helper like async_read_text or use aiofiles.open within an async
context) to read the file contents instead of using open(...), keep the rest of
the parsing logic identical, and ensure the method signature remains async so
callers can await it.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@src/aiperf/config/dataset/resolver.py`:
- Around line 246-247: The synchronous open(...) call in DatasetResolver (the
method that probes/parses the JSONL at file_path) must be replaced with an async
file-read helper so we avoid blocking I/O; modify the method to await an async
reader (e.g., use an existing project helper like async_read_text or use
aiofiles.open within an async context) to read the file contents instead of
using open(...), keep the rest of the parsing logic identical, and ensure the
method signature remains async so callers can await it.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2af67c68-4ff7-4371-892e-334e999cb020

📥 Commits

Reviewing files that changed from the base of the PR and between a27a7b9 and 75b717a.

📒 Files selected for processing (1)

src/aiperf/config/dataset/resolver.py

dynamo-ops

Previous review comments have been addressed. Approving.

fix(config): auto-detect BurstGPT CSV in DatasetResolver and pin fixe…

95e1abd

…d_schedule regression Signed-off-by: lkomali <lkomali@nvidia.com>

github-actions Bot added the fix label May 22, 2026

lkomali requested review from ajcasagrande, debermudez and ganeshku1 May 22, 2026 18:24

coderabbitai Bot reviewed May 22, 2026

View reviewed changes

Comment thread src/aiperf/config/dataset/resolver.py

dynamo-ops reviewed May 22, 2026

View reviewed changes

Comment thread src/aiperf/config/dataset/resolver.py Outdated

ajcasagrande reviewed May 22, 2026

View reviewed changes

Comment thread src/aiperf/config/dataset/resolver.py Outdated

fix(config): use orjson directly in CSV probe to suppress misleading …

d7b6f43

…JSON-parse warning Signed-off-by: lkomali <lkomali@nvidia.com>

lkomali requested a review from ajcasagrande May 22, 2026 18:47

fix(config): catch UnicodeDecodeError and guard dict-only return in C…

a27a7b9

…SV probe Signed-off-by: lkomali <lkomali@nvidia.com>

coderabbitai Bot reviewed May 22, 2026

View reviewed changes

Comment thread src/aiperf/config/dataset/resolver.py

dynamo-ops approved these changes May 22, 2026

View reviewed changes

ajcasagrande approved these changes May 22, 2026

View reviewed changes

fix(config): pin CSV probe to utf-8 encoding for cross-platform deter…

75b717a

…minism Signed-off-by: lkomali <lkomali@nvidia.com>

coderabbitai Bot reviewed May 22, 2026

View reviewed changes

dynamo-ops approved these changes May 22, 2026

View reviewed changes

lkomali merged commit 9fc0b76 into main May 22, 2026
20 checks passed

lkomali deleted the lkomali/burst-gpt-fixed-schedule-resolver branch May 22, 2026 20:37

lkomali mentioned this pull request May 22, 2026

fix(config): backport BurstGPT CSV + fixed_schedule fixes to release/0.9.0 #988

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(config): auto-detect BurstGPT CSV in DatasetResolver and pin fixed_schedule regression#984

fix(config): auto-detect BurstGPT CSV in DatasetResolver and pin fixed_schedule regression#984
lkomali merged 4 commits into
mainfrom
lkomali/burst-gpt-fixed-schedule-resolver

lkomali commented May 22, 2026 •

edited by coderabbitai Bot

Loading

copy-pr-bot Bot commented May 22, 2026

github-actions Bot commented May 22, 2026 •

edited

Loading

coderabbitai Bot commented May 22, 2026 •

edited

Loading

coderabbitai Bot left a comment

Uh oh!

codecov Bot commented May 22, 2026 •

edited

Loading

Uh oh!

ajcasagrande left a comment

Uh oh!

coderabbitai Bot left a comment

Uh oh!

ajcasagrande left a comment

coderabbitai Bot left a comment

dynamo-ops left a comment

Uh oh!

Labels

3 participants

Uh oh!

Conversation

lkomali commented May 22, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Summary by CodeRabbit

copy-pr-bot Bot commented May 22, 2026

github-actions Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Try out this PR

coderabbitai Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ajcasagrande left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

ajcasagrande left a comment

Choose a reason for hiding this comment

coderabbitai Bot left a comment

Choose a reason for hiding this comment

dynamo-ops left a comment

Choose a reason for hiding this comment

Uh oh!

Labels

3 participants

lkomali commented May 22, 2026 •

edited by coderabbitai Bot

Loading

github-actions Bot commented May 22, 2026 •

edited

Loading

coderabbitai Bot commented May 22, 2026 •

edited

Loading

codecov Bot commented May 22, 2026 •

edited

Loading