Skip to content

fix(dataset): ShareGPT multi-turn handling#828

Merged
lkomali merged 11 commits into
ai-dynamo:mainfrom
ankanand-nv:fix/sharegpt-multi-turn-798
May 22, 2026
Merged

fix(dataset): ShareGPT multi-turn handling#828
lkomali merged 11 commits into
ai-dynamo:mainfrom
ankanand-nv:fix/sharegpt-multi-turn-798

Conversation

@ankanand-nv

@ankanand-nv ankanand-nv commented Apr 15, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Improves ShareGPT multi-turn dataset behavior.

Fixes #798

Summary by CodeRabbit

  • New Features

    • Multi-turn conversation support across dataset loaders with single-turn as default.
    • Public multi_turn configuration option to enable multi-turn behavior.
    • Role-aware pairing (human/user → assistant); images attached only to the first turn.
    • Optional per-turn token-length validation when a tokenizer is provided; invalid or incomplete pairs are skipped.
  • Tests

    • Added unit tests covering multi-turn conversion, image handling, skipping invalid/empty pairs, and token-length validation.

Review Change Stack

@copy-pr-bot

copy-pr-bot Bot commented Apr 15, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Apr 15, 2026

Copy link
Copy Markdown
Contributor

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Loader metadata gains a multi_turn flag; the composer forwards it into loader kwargs. HFConversation, ShareGPT, SpecBench, and SpeedBench loaders now support multi-turn conversion (HF adds optional tokenizer-based per-pair validation). Tests and JSON schema updated to cover multi-turn behavior.

Changes

Multi-turn public dataset loaders

Layer / File(s) Summary
Schema / Config
src/aiperf/plugin/schema/schemas.py, src/aiperf/plugin/schema/plugins.schema.json
Adds multi_turn: bool = False to PublicDatasetLoaderMetadata and JSON schema to expose multi-turn configuration.
Loader wiring
src/aiperf/dataset/composer/public.py
_build_loader_kwargs() now conditionally includes multi_turn=True when loader metadata sets it.
ShareGPT pairing & conversion
src/aiperf/dataset/loader/sharegpt.py, tests/unit/dataset/loader/test_sharegpt.py
Adds _sharegpt_prompt_completion_pairs() to derive ordered human→gpt pairs, validates token lengths per pair, accumulates Turns across pairs, and skips entries with no valid pairs or invalid token checks.
SpecBench multi-turn conversion
src/aiperf/dataset/loader/spec_bench.py, tests/unit/dataset/loader/test_spec_bench_loader.py
Adds multi_turn init flag and conversion that uses all non-empty turns entries when enabled; preserves single-turn default and adds tests for list/type/empty/None cases.
SpeedBench multi-turn conversion
src/aiperf/dataset/loader/speed_bench.py, tests/unit/dataset/loader/test_speed_bench.py
Adds multi_turn init flag, documents behavior, and converts turns into multiple Turns when enabled; retains single-turn default and associated tests.
HF loader — tokenizer & multi-turn
src/aiperf/dataset/loader/hf_conversation.py, tests/unit/dataset/loader/test_hf_conversation_loader.py
Adds multi_turn and optional tokenizer to __init__, message normalization helpers, per-pair tokenizer validation, and multi-turn Conversation construction (images attached only to first turn); tests for multi- and single-turn scenarios.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

🐰 I hopped through lines, found echoes of chat,
Paired voices in order, where many were at.
First turn keeps the picture, the rest take their cue,
Tokens counted politely — more turns to review! 🥕

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 41.30% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'fix(dataset): ShareGPT multi-turn handling' is specific and accurately describes the main change. While the PR includes broader multi-turn support across multiple loaders, the title captures the primary fix (ShareGPT) which is the focus of the referenced issue #798.
Linked Issues check ✅ Passed The PR implements all objectives from issue #798: role-based filtering replaces hardcoded extraction [sharegpt.py], rows with <2 turns are skipped [sharegpt.py], multiple human→gpt pairs generate multiple Turn objects [sharegpt.py], per-turn max_tokens computed from completion length [sharegpt.py], malformed entries skipped [sharegpt.py], and Conversation objects with session IDs and assembled turns created [sharegpt.py]. Additionally, multi-turn support is extended to HFConversationDatasetLoader, SpecBenchLoader, and SpeedBenchLoader [hf_conversation.py, spec_bench.py, speed_bench.py], with corresponding schema updates [schemas.py, plugins.schema.json] and comprehensive test coverage [test files].
Out of Scope Changes check ✅ Passed Changes extend beyond issue #798's ShareGPT scope to implement multi-turn support across four loaders (HFConversation, ShareGPT, SpecBench, SpeedBench) and add schema fields. However, PR objectives explicitly define these as intentional expansions: multi_turn flag added to metadata schema and multiple loaders support it. The PR discussion confirms this design decision is by intent, not accidental scope creep.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Apr 15, 2026

Copy link
Copy Markdown

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@5d82459289f294237783d4cab129f0056aaa3277

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@5d82459289f294237783d4cab129f0056aaa3277

Last updated for commit: 5d82459Browse code

@github-actions github-actions Bot added the fix label Apr 15, 2026
@codecov

codecov Bot commented Apr 15, 2026

Copy link
Copy Markdown
@ganeshku1 ganeshku1 requested a review from lkomali April 15, 2026 00:45

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
src/aiperf/dataset/loader/hf_conversation.py (1)

214-235: Consider caching completion token lengths to avoid duplicate tokenization.

When tokenizer is provided, completions are tokenized twice: once in _pairs_pass_validation() (line 180-181) and again in the Turn construction (line 223). For large datasets, this doubles the tokenization overhead.

♻️ Suggested optimization

Modify _pairs_pass_validation to return token lengths alongside the validation result, or compute lengths once before validation:

-    def _pairs_pass_validation(self, pairs: list[tuple[str, str]]) -> bool:
+    def _pairs_pass_validation(
+        self, pairs: list[tuple[str, str]]
+    ) -> list[int] | None:
+        """Validate pairs and return completion token lengths, or None if invalid."""
         if self.tokenizer is None:
-            return True
+            return [0] * len(pairs)  # placeholder when no tokenizer
+        completion_lengths: list[int] = []
         for prompt, completion in pairs:
             prompt_length = len(self.tokenizer.encode(prompt))
             completion_length = len(self.tokenizer.encode(completion))
             if not self.is_valid_sequence(
                 prompt_len=prompt_length,
                 output_len=completion_length,
                 skip_min_output_len_check=self.output_tokens_mean is not None,
             ):
-                return False
-        return True
+                return None
+            completion_lengths.append(completion_length)
+        return completion_lengths

Then use the returned lengths directly in Turn construction.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/aiperf/dataset/loader/hf_conversation.py` around lines 214 - 235, The
code currently tokenizes completions twice (once in _pairs_pass_validation and
again when constructing Turn.max_tokens); update the flow to compute token
lengths only once by either changing _pairs_pass_validation to return token
lengths alongside its validation result or by computing lengths immediately
after _prompt_completion_pairs and passing them into validation; then use those
cached lengths when building each Turn (in the Conversation creation block that
uses Turn, tokenizer, and session_id_generator) instead of calling
tokenizer.encode a second time. Ensure the returned/produced structure maps each
(prompt, completion) pair to its token length so Turn.max_tokens can consume the
cached value.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/aiperf/dataset/loader/sharegpt.py`:
- Around line 152-153: Run the project formatter ( Ruff/ruff-format via
pre-commit ) on src/aiperf/dataset/loader/sharegpt.py and re-commit the changes
so the declaration region (variables like pairs and i) matches the repository
formatting rules; you can run the configured pre-commit hooks (or ruff --fix /
ruff format) locally and then re-push the updated file.

In `@src/aiperf/dataset/loader/spec_bench.py`:
- Around line 51-53: The loader currently skips multi-turn rows when
turns_raw[0] is empty, causing valid later turns to be dropped; change the
precondition to check whether any turn in turns_raw is non-empty (e.g.,
any(str(t).strip() for t in turns_raw)) before incrementing skipped, and when
building turns for multi-turn mode use all non-empty turns rather than relying
on turns_raw[0]; update the same logic used later in the block (the code around
turns_raw and skipped) to apply this “any non-empty turn” rule consistently.

In `@src/aiperf/dataset/loader/speed_bench.py`:
- Around line 82-84: The current validation rejects rows by checking
turns_raw[0], which applies single-turn validation to multi-turn mode; change
the initial check on turns_raw to only ensure it's a non-empty list (e.g.,
turns_raw is truthy and isinstance(list) and len(turns_raw) > 0) and remove the
turns_raw[0] emptiness test there; move per-turn emptiness checks into the
single-turn branch (when you expect exactly one turn) and into the multi-turn
processing where you filter/skip empty turns, and apply the same fix to the
similar checks in the block handling turns (the code handling turns_raw/skipped
in the 87-112 region) so rows aren't discarded just because the first element is
empty.

In `@tests/unit/dataset/loader/test_sharegpt.py`:
- Around line 109-112: The test hunk formatting was changed by ruff-format in
CI; re-run your formatter (pre-commit or ruff-format) locally and apply its
changes to this test block so the assertions match the formatted style (look for
the assertions using conv.turns[0].texts, conv.turns[0].max_tokens,
conv.turns[1].texts, and conv.turns[1].max_tokens) and then commit the resulting
diff so CI no longer fails the formatting check.

---

Nitpick comments:
In `@src/aiperf/dataset/loader/hf_conversation.py`:
- Around line 214-235: The code currently tokenizes completions twice (once in
_pairs_pass_validation and again when constructing Turn.max_tokens); update the
flow to compute token lengths only once by either changing
_pairs_pass_validation to return token lengths alongside its validation result
or by computing lengths immediately after _prompt_completion_pairs and passing
them into validation; then use those cached lengths when building each Turn (in
the Conversation creation block that uses Turn, tokenizer, and
session_id_generator) instead of calling tokenizer.encode a second time. Ensure
the returned/produced structure maps each (prompt, completion) pair to its token
length so Turn.max_tokens can consume the cached value.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ec8d6d27-3799-4b58-89a7-cfb4fb5fd393

📥 Commits

Reviewing files that changed from the base of the PR and between f4e97f9 and 6553318.

📒 Files selected for processing (10)
  • src/aiperf/dataset/composer/public.py
  • src/aiperf/dataset/loader/hf_conversation.py
  • src/aiperf/dataset/loader/sharegpt.py
  • src/aiperf/dataset/loader/spec_bench.py
  • src/aiperf/dataset/loader/speed_bench.py
  • src/aiperf/plugin/schema/schemas.py
  • tests/unit/dataset/loader/test_hf_conversation_loader.py
  • tests/unit/dataset/loader/test_sharegpt.py
  • tests/unit/dataset/loader/test_spec_bench_loader.py
  • tests/unit/dataset/loader/test_speed_bench.py
Comment thread src/aiperf/dataset/loader/sharegpt.py
Comment thread src/aiperf/dataset/loader/spec_bench.py Outdated
Comment thread src/aiperf/dataset/loader/speed_bench.py
Comment thread tests/unit/dataset/loader/test_sharegpt.py Outdated
Comment thread src/aiperf/dataset/loader/sharegpt.py Outdated
Comment thread src/aiperf/dataset/loader/sharegpt.py Outdated
Comment thread src/aiperf/dataset/loader/speed_bench.py Outdated
@lkomali

lkomali commented Apr 15, 2026

Copy link
Copy Markdown
Contributor

A few design questions on my mind.

  1. Should users be able to toggle ShareGPT between single-turn and multi-turn?
  2. If yes, should we leverage a --multi-turn CLI flag?
  3. From what I understand, the assistant/gpt responses from the dataset are being discarded and we use the live model response as context for the next turn. Is that the right behavior? Or should we be using the dataset responses?

cc: @ajcasagrande @debermudez

@lkomali

lkomali commented Apr 15, 2026

Copy link
Copy Markdown
Contributor

Can you run pre-commit run --all-files?

ankanand-nv added a commit to ankanand-nv/aiperf that referenced this pull request Apr 15, 2026
- Use .get("value") instead of ["value"] in ShareGPT pairing (KeyError safety)
- Move turns_raw[0] check into single-turn branch only (multi-turn was
  incorrectly dropping rows with empty first turn but valid later turns)
- Update SpeedBench docstring to mention multi-turn support
- Cache tokenization in HF conversation loader (avoid double encode)
- Fix ruff formatting on sharegpt.py and test_sharegpt.py
- Regenerate plugin schema artifacts

Signed-off-by: Ankit Anand <ankanand@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/aiperf/dataset/loader/spec_bench.py`:
- Around line 51-73: The loader currently accepts any type for turns_raw and
stringifies None into "None" and can split strings into char turns; fix by first
validating turns_raw is a list (if not, increment skipped and continue), then
when building conv_turns iterate only over elements where t is not None and not
an empty string, converting to text with text = str(t).strip() only for non-None
values; similarly for the single-turn branch ensure turns_raw is a non-empty
list and the first element is not None before doing prompt =
str(turns_raw[0]).strip(), otherwise increment skipped and continue; update code
around the turns_raw variable, the multi_turn branch that constructs Turn/Text
and the single-turn prompt logic, using session_id_generator.next() and
Conversation/Turn/Text as before.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 655b7472-bd2a-4dd6-84af-444f214406b1

📥 Commits

Reviewing files that changed from the base of the PR and between 6553318 and 5497cd8.

📒 Files selected for processing (6)
  • src/aiperf/dataset/loader/hf_conversation.py
  • src/aiperf/dataset/loader/sharegpt.py
  • src/aiperf/dataset/loader/spec_bench.py
  • src/aiperf/dataset/loader/speed_bench.py
  • src/aiperf/plugin/schema/plugins.schema.json
  • tests/unit/dataset/loader/test_sharegpt.py
🚧 Files skipped from review as they are similar to previous changes (4)
  • tests/unit/dataset/loader/test_sharegpt.py
  • src/aiperf/dataset/loader/sharegpt.py
  • src/aiperf/dataset/loader/speed_bench.py
  • src/aiperf/dataset/loader/hf_conversation.py
Comment thread src/aiperf/dataset/loader/spec_bench.py Outdated
@ankanand-nv

Copy link
Copy Markdown
Contributor Author

1. Should users be able to toggle ShareGPT between single-turn and multi-turn?

I don't think so — looking at the code, ShareGPT is inherently multi-turn. The loader already extracts all human→gpt pairs from every conversation via _sharegpt_prompt_completion_pairs and creates a Turn for each. So there's really no single-turn mode to toggle to since the dataset format itself is conversational. The multi_turn flag in this PR is meant for the other loaders (HFConversationDatasetLoader, SpeedBenchLoader, SpecBenchLoader) which default to single-turn and need an explicit opt-in.

Worth noting — the plugins.yaml description for ShareGPT says "processes the first prompt/completion pair" which seems like a pre-existing doc inaccuracy. The code actually processes all pairs. We could fix that separately.

2. If yes, should we leverage a --multi-turn CLI flag?

Since I think #1 is a no, this probably isn't needed for ShareGPT right now. For the other loaders, multi_turn is configured per-plugin in plugins.yaml metadata (PublicDatasetLoaderMetadata), which follows the existing pattern. A CLI override flag could make sense as a follow-up for convenience, but I'd keep it out of this PR to keep scope tight.

3. Are dataset assistant/gpt responses discarded? Is that correct?

Yes — and I believe that's the right call. The dataset completion text is only used to set max_tokens (how many tokens the model should generate). The actual assistant context that builds up across turns comes from the model's live inference responses, which is the default DELTAS_WITHOUT_RESPONSES context mode in session_manager.py.

So for a 3-turn conversation the requests end up looking like:

  • Turn 1: [user: prompt_1] → model responds
  • Turn 2: [user: prompt_1, assistant: <live_response_1>, user: prompt_2] → model responds
  • Turn 3: [..., assistant: <live_response_2>, user: prompt_3]

I think this makes sense for benchmarking real multi-turn serving — we want to test the model handling growing context from its own responses, not canned dataset answers. The framework does already support DELTAS_WITH_RESPONSES and MESSAGE_ARRAY_WITH_RESPONSES for datasets that include pre-canned assistant turns, but ShareGPT doesn't need those.

@lkomali

@lkomali lkomali left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR looks great! Thanks for working on this.
A couple of comments and a general nit
A few places uses non-descriptive variables like i, j, m etc. More descriptive names would improve readability.

Comment thread src/aiperf/dataset/loader/hf_conversation.py
Comment thread src/aiperf/dataset/loader/hf_conversation.py
Comment thread src/aiperf/dataset/loader/speed_bench.py Outdated
Comment thread src/aiperf/dataset/loader/sharegpt.py Outdated

@lkomali lkomali left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Just a couple of nits.

Comment thread src/aiperf/dataset/loader/hf_conversation.py Outdated
Comment thread src/aiperf/dataset/loader/hf_conversation.py Outdated
Comment thread src/aiperf/dataset/loader/hf_conversation.py

@lkomali lkomali left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fixes! Approving.

Comment thread src/aiperf/dataset/loader/hf_conversation.py
ankanand-nv and others added 8 commits May 22, 2026 12:36
ShareGPTLoader only used the first two conversation messages. When entries
use from=human/gpt roles, collect every adjacent human→gpt pair as a Turn.
Keep the previous first-two-values behavior when role fields are absent.

Fixes ai-dynamo#798

Signed-off-by: Ankit Anand <ankanand@nvidia.com>
Made-with: Cursor
- Add opt-in multi_turn to SpeedBenchLoader and SpecBenchLoader
- Convert _sharegpt_style_pairs to instance method (remove untyped text_fn)
- Move multi_turn composer wiring to apply to all loaders
- Add missing tests: system msgs, empty assistant, image+multi-turn,
  validation rejection, SpeedBench/SpecBench multi-turn

Signed-off-by: Ankit Anand <ankanand@nvidia.com>
- Use .get("value") instead of ["value"] in ShareGPT pairing (KeyError safety)
- Move turns_raw[0] check into single-turn branch only (multi-turn was
  incorrectly dropping rows with empty first turn but valid later turns)
- Update SpeedBench docstring to mention multi-turn support
- Cache tokenization in HF conversation loader (avoid double encode)
- Fix ruff formatting on sharegpt.py and test_sharegpt.py
- Regenerate plugin schema artifacts

Signed-off-by: Ankit Anand <ankanand@nvidia.com>
- Tighten _normalize_messages and _openai_style_pairs in HFConversationDatasetLoader
- Defer _select_model_name() in ShareGPTLoader until row is fully validated
- Add isinstance(list) guard in SpecBenchLoader for malformed turns
- Condense null-safe single-turn extraction in SpeedBench/SpecBench
- Add tests for missing value key, non-list turns, and null turn values

Signed-off-by: Ankit Anand <ankanand@nvidia.com>
- N1: fall through to OpenAI pairing when ShareGPT returns empty pairs
- N2: remove dead [0]*len(pairs) path; return [] when no tokenizer
- N3: prefer first user/human message in single-turn extraction

Signed-off-by: Ankit Anand <ankanand@nvidia.com>
Cover _extract_first_message preferring user/human role over system
or assistant preambles, and the fallback to the first message when
no role fields are present.

Signed-off-by: Ankit Anand <ankanand@nvidia.com>
Replace the obsolete UserConfig parameter (deleted from main during the
rebase window) with the BenchmarkRun-based __init__ pattern used by
ShareGPTLoader and HFConversationDatasetLoader. Update the matching
unit tests (spec_bench, speed_bench, hf_conversation) to use the
cli_config fixture + make_run_from_cli helper instead of a removed
user_config fixture.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: lkomali <lkomali@nvidia.com>
…HF multi-turn

Signed-off-by: lkomali <lkomali@nvidia.com>
@lkomali lkomali force-pushed the fix/sharegpt-multi-turn-798 branch from 175eeb9 to 6e84fe0 Compare May 22, 2026 20:24
…signature change

Signed-off-by: lkomali <lkomali@nvidia.com>
Comment thread src/aiperf/dataset/loader/sharegpt.py Outdated
@lkomali lkomali merged commit cc26cca into ai-dynamo:main May 22, 2026
22 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 participants