fix(dataset): ShareGPT multi-turn handling#828
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughLoader metadata gains a ChangesMulti-turn public dataset loaders
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@5d82459289f294237783d4cab129f0056aaa3277Recommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@5d82459289f294237783d4cab129f0056aaa3277Last updated for commit: |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (1)
src/aiperf/dataset/loader/hf_conversation.py (1)
214-235: Consider caching completion token lengths to avoid duplicate tokenization.When
tokenizeris provided, completions are tokenized twice: once in_pairs_pass_validation()(line 180-181) and again in the Turn construction (line 223). For large datasets, this doubles the tokenization overhead.♻️ Suggested optimization
Modify
_pairs_pass_validationto return token lengths alongside the validation result, or compute lengths once before validation:- def _pairs_pass_validation(self, pairs: list[tuple[str, str]]) -> bool: + def _pairs_pass_validation( + self, pairs: list[tuple[str, str]] + ) -> list[int] | None: + """Validate pairs and return completion token lengths, or None if invalid.""" if self.tokenizer is None: - return True + return [0] * len(pairs) # placeholder when no tokenizer + completion_lengths: list[int] = [] for prompt, completion in pairs: prompt_length = len(self.tokenizer.encode(prompt)) completion_length = len(self.tokenizer.encode(completion)) if not self.is_valid_sequence( prompt_len=prompt_length, output_len=completion_length, skip_min_output_len_check=self.output_tokens_mean is not None, ): - return False - return True + return None + completion_lengths.append(completion_length) + return completion_lengthsThen use the returned lengths directly in Turn construction.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/aiperf/dataset/loader/hf_conversation.py` around lines 214 - 235, The code currently tokenizes completions twice (once in _pairs_pass_validation and again when constructing Turn.max_tokens); update the flow to compute token lengths only once by either changing _pairs_pass_validation to return token lengths alongside its validation result or by computing lengths immediately after _prompt_completion_pairs and passing them into validation; then use those cached lengths when building each Turn (in the Conversation creation block that uses Turn, tokenizer, and session_id_generator) instead of calling tokenizer.encode a second time. Ensure the returned/produced structure maps each (prompt, completion) pair to its token length so Turn.max_tokens can consume the cached value.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/aiperf/dataset/loader/sharegpt.py`:
- Around line 152-153: Run the project formatter ( Ruff/ruff-format via
pre-commit ) on src/aiperf/dataset/loader/sharegpt.py and re-commit the changes
so the declaration region (variables like pairs and i) matches the repository
formatting rules; you can run the configured pre-commit hooks (or ruff --fix /
ruff format) locally and then re-push the updated file.
In `@src/aiperf/dataset/loader/spec_bench.py`:
- Around line 51-53: The loader currently skips multi-turn rows when
turns_raw[0] is empty, causing valid later turns to be dropped; change the
precondition to check whether any turn in turns_raw is non-empty (e.g.,
any(str(t).strip() for t in turns_raw)) before incrementing skipped, and when
building turns for multi-turn mode use all non-empty turns rather than relying
on turns_raw[0]; update the same logic used later in the block (the code around
turns_raw and skipped) to apply this “any non-empty turn” rule consistently.
In `@src/aiperf/dataset/loader/speed_bench.py`:
- Around line 82-84: The current validation rejects rows by checking
turns_raw[0], which applies single-turn validation to multi-turn mode; change
the initial check on turns_raw to only ensure it's a non-empty list (e.g.,
turns_raw is truthy and isinstance(list) and len(turns_raw) > 0) and remove the
turns_raw[0] emptiness test there; move per-turn emptiness checks into the
single-turn branch (when you expect exactly one turn) and into the multi-turn
processing where you filter/skip empty turns, and apply the same fix to the
similar checks in the block handling turns (the code handling turns_raw/skipped
in the 87-112 region) so rows aren't discarded just because the first element is
empty.
In `@tests/unit/dataset/loader/test_sharegpt.py`:
- Around line 109-112: The test hunk formatting was changed by ruff-format in
CI; re-run your formatter (pre-commit or ruff-format) locally and apply its
changes to this test block so the assertions match the formatted style (look for
the assertions using conv.turns[0].texts, conv.turns[0].max_tokens,
conv.turns[1].texts, and conv.turns[1].max_tokens) and then commit the resulting
diff so CI no longer fails the formatting check.
---
Nitpick comments:
In `@src/aiperf/dataset/loader/hf_conversation.py`:
- Around line 214-235: The code currently tokenizes completions twice (once in
_pairs_pass_validation and again when constructing Turn.max_tokens); update the
flow to compute token lengths only once by either changing
_pairs_pass_validation to return token lengths alongside its validation result
or by computing lengths immediately after _prompt_completion_pairs and passing
them into validation; then use those cached lengths when building each Turn (in
the Conversation creation block that uses Turn, tokenizer, and
session_id_generator) instead of calling tokenizer.encode a second time. Ensure
the returned/produced structure maps each (prompt, completion) pair to its token
length so Turn.max_tokens can consume the cached value.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: ec8d6d27-3799-4b58-89a7-cfb4fb5fd393
📒 Files selected for processing (10)
src/aiperf/dataset/composer/public.pysrc/aiperf/dataset/loader/hf_conversation.pysrc/aiperf/dataset/loader/sharegpt.pysrc/aiperf/dataset/loader/spec_bench.pysrc/aiperf/dataset/loader/speed_bench.pysrc/aiperf/plugin/schema/schemas.pytests/unit/dataset/loader/test_hf_conversation_loader.pytests/unit/dataset/loader/test_sharegpt.pytests/unit/dataset/loader/test_spec_bench_loader.pytests/unit/dataset/loader/test_speed_bench.py
|
A few design questions on my mind.
|
|
Can you run |
- Use .get("value") instead of ["value"] in ShareGPT pairing (KeyError safety)
- Move turns_raw[0] check into single-turn branch only (multi-turn was
incorrectly dropping rows with empty first turn but valid later turns)
- Update SpeedBench docstring to mention multi-turn support
- Cache tokenization in HF conversation loader (avoid double encode)
- Fix ruff formatting on sharegpt.py and test_sharegpt.py
- Regenerate plugin schema artifacts
Signed-off-by: Ankit Anand <ankanand@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/aiperf/dataset/loader/spec_bench.py`:
- Around line 51-73: The loader currently accepts any type for turns_raw and
stringifies None into "None" and can split strings into char turns; fix by first
validating turns_raw is a list (if not, increment skipped and continue), then
when building conv_turns iterate only over elements where t is not None and not
an empty string, converting to text with text = str(t).strip() only for non-None
values; similarly for the single-turn branch ensure turns_raw is a non-empty
list and the first element is not None before doing prompt =
str(turns_raw[0]).strip(), otherwise increment skipped and continue; update code
around the turns_raw variable, the multi_turn branch that constructs Turn/Text
and the single-turn prompt logic, using session_id_generator.next() and
Conversation/Turn/Text as before.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 655b7472-bd2a-4dd6-84af-444f214406b1
📒 Files selected for processing (6)
src/aiperf/dataset/loader/hf_conversation.pysrc/aiperf/dataset/loader/sharegpt.pysrc/aiperf/dataset/loader/spec_bench.pysrc/aiperf/dataset/loader/speed_bench.pysrc/aiperf/plugin/schema/plugins.schema.jsontests/unit/dataset/loader/test_sharegpt.py
🚧 Files skipped from review as they are similar to previous changes (4)
- tests/unit/dataset/loader/test_sharegpt.py
- src/aiperf/dataset/loader/sharegpt.py
- src/aiperf/dataset/loader/speed_bench.py
- src/aiperf/dataset/loader/hf_conversation.py
|
1. Should users be able to toggle ShareGPT between single-turn and multi-turn? I don't think so — looking at the code, ShareGPT is inherently multi-turn. The loader already extracts all human→gpt pairs from every conversation via Worth noting — the 2. If yes, should we leverage a Since I think #1 is a no, this probably isn't needed for ShareGPT right now. For the other loaders, 3. Are dataset assistant/gpt responses discarded? Is that correct? Yes — and I believe that's the right call. The dataset completion text is only used to set So for a 3-turn conversation the requests end up looking like:
I think this makes sense for benchmarking real multi-turn serving — we want to test the model handling growing context from its own responses, not canned dataset answers. The framework does already support |
lkomali
left a comment
There was a problem hiding this comment.
The PR looks great! Thanks for working on this.
A couple of comments and a general nit
A few places uses non-descriptive variables like i, j, m etc. More descriptive names would improve readability.
lkomali
left a comment
There was a problem hiding this comment.
Looks good! Just a couple of nits.
lkomali
left a comment
There was a problem hiding this comment.
Thanks for the quick fixes! Approving.
ShareGPTLoader only used the first two conversation messages. When entries use from=human/gpt roles, collect every adjacent human→gpt pair as a Turn. Keep the previous first-two-values behavior when role fields are absent. Fixes ai-dynamo#798 Signed-off-by: Ankit Anand <ankanand@nvidia.com> Made-with: Cursor
- Add opt-in multi_turn to SpeedBenchLoader and SpecBenchLoader - Convert _sharegpt_style_pairs to instance method (remove untyped text_fn) - Move multi_turn composer wiring to apply to all loaders - Add missing tests: system msgs, empty assistant, image+multi-turn, validation rejection, SpeedBench/SpecBench multi-turn Signed-off-by: Ankit Anand <ankanand@nvidia.com>
- Use .get("value") instead of ["value"] in ShareGPT pairing (KeyError safety)
- Move turns_raw[0] check into single-turn branch only (multi-turn was
incorrectly dropping rows with empty first turn but valid later turns)
- Update SpeedBench docstring to mention multi-turn support
- Cache tokenization in HF conversation loader (avoid double encode)
- Fix ruff formatting on sharegpt.py and test_sharegpt.py
- Regenerate plugin schema artifacts
Signed-off-by: Ankit Anand <ankanand@nvidia.com>
- Tighten _normalize_messages and _openai_style_pairs in HFConversationDatasetLoader - Defer _select_model_name() in ShareGPTLoader until row is fully validated - Add isinstance(list) guard in SpecBenchLoader for malformed turns - Condense null-safe single-turn extraction in SpeedBench/SpecBench - Add tests for missing value key, non-list turns, and null turn values Signed-off-by: Ankit Anand <ankanand@nvidia.com>
- N1: fall through to OpenAI pairing when ShareGPT returns empty pairs - N2: remove dead [0]*len(pairs) path; return [] when no tokenizer - N3: prefer first user/human message in single-turn extraction Signed-off-by: Ankit Anand <ankanand@nvidia.com>
Cover _extract_first_message preferring user/human role over system or assistant preambles, and the fallback to the first message when no role fields are present. Signed-off-by: Ankit Anand <ankanand@nvidia.com>
Replace the obsolete UserConfig parameter (deleted from main during the rebase window) with the BenchmarkRun-based __init__ pattern used by ShareGPTLoader and HFConversationDatasetLoader. Update the matching unit tests (spec_bench, speed_bench, hf_conversation) to use the cli_config fixture + make_run_from_cli helper instead of a removed user_config fixture. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: lkomali <lkomali@nvidia.com>
…HF multi-turn Signed-off-by: lkomali <lkomali@nvidia.com>
175eeb9 to
6e84fe0
Compare
…signature change Signed-off-by: lkomali <lkomali@nvidia.com>
…g to skip malformed rows Signed-off-by: lkomali <lkomali@nvidia.com>
Summary
Fixes #798
Summary by CodeRabbit
New Features
multi_turnconfiguration option to enable multi-turn behavior.Tests