feat(mock-server): add --record-requests for per-request ISL/OSL capture#962
Conversation
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@0712fbd6f10c87c1574b28c26cdd6498c8a7a7edRecommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@0712fbd6f10c87c1574b28c26cdd6498c8a7a7edLast updated for commit: |
ab21812 to
b539267
Compare
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis pull request adds per-request recording to the mock server: it captures ISL and requested-OSL per request, supports token-id prompts, accumulates per-endpoint samples and token-frequency counts, emits JSONL per request, and writes/prints per-endpoint quantiles, equal-width histograms, and optional vocab-distribution summaries on shutdown. ChangesRequest Recording Stats & Vocabulary Distribution
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
🧹 Nitpick comments (1)
tests/unit/aiperf_mock_server/test_request_recorder.py (1)
17-31: ⚡ Quick winAlign test names and signatures with repo test conventions.
Several tests don’t follow the required
test_<function>_<scenario>_<expected>pattern, and some test parameters are untyped (e.g.,values,expected,capsys). Please normalize names/signatures to the enforced style for consistency.As per coding guidelines: "
tests/**/*.py: Test naming convention:test_<function>_<scenario>_<expected>" and "**/*.py: Include type hints on ALL functions (params and return)".Also applies to: 85-149, 151-260
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/aiperf_mock_server/test_request_recorder.py` around lines 17 - 31, Test function and parameters in TestHistogram violate naming/type conventions: rename test_degenerate_inputs to follow test__histogram_degenerate_inputs_returns_expected and add full type hints to the signature (e.g., values: Sequence[float] and expected: Optional[Mapping[str, Any]]) and the return type -> None; update the pytest.mark.parametrize parameter names if necessary to match the typed signature and keep the same param cases (single_value, all_equal, empty_returns_none); locate the _histogram usage to ensure the test name references that function and apply the same renaming+type-hint pattern to the other affected tests referenced (lines 85-149,151-260) so all tests follow test_<function>_<scenario>_<expected> and include parameter and return type annotations.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/superpowers/plans/2026-05-18-recorder-stats-extension.md`:
- Around line 726-741: The fenced stdout sample in the markdown lacks a language
specifier; update the opening fence from ``` to ```text in the block containing
the "Request distribution (100 requests)" sample so the code fence is explicitly
labeled as text (this satisfies markdownlint MD040).
In `@docs/superpowers/specs/2026-05-18-recorder-stats-extension-design.md`:
- Around line 28-31: Update the Markdown fenced code blocks that are missing a
language tag by adding the `text` fence language; specifically, locate the block
containing the formulas starting with "num_bins = max(10, ceil((max - min) /
100))" and the output/example block starting with "/v1/completions n=100" (and
the other similar blocks around lines 71-83) and change their opening fences
from ``` to ```text so markdownlint MD040 is satisfied.
- Around line 1-6: Add the new docs file to the site index: open docs/index.yml
and insert an entry for
"superpowers/specs/2026-05-18-recorder-stats-extension-design.md" (the exact
filename from the diff) under the appropriate "superpowers" or "specs" section
following the existing entry format/indentation used in the file so the Fern
site will include the new spec; ensure the YAML key order and indentation match
other entries and commit the updated docs/index.yml.
In `@tests/aiperf_mock_server/README.md`:
- Around line 483-504: The fenced code block that begins with "Request
distribution (5000 requests)" is missing a language specifier; update that
triple-backtick fence to include "text" (i.e., change ``` to ```text) so the
example output block is explicitly marked as plain text for proper Markdown
rendering and tooling.
- Around line 548-560: Update the fenced code block in README.md that displays
the tests/aiperf_mock_server/ tree to include a language specifier by changing
the opening fence from ``` to ```text so the block becomes ```text followed by
the directory listing (keep the closing ``` unchanged); locate the code block in
the README where the project structure is shown and modify that fence, e.g., the
block that begins with the line showing "tests/aiperf_mock_server/" inside the
README.md.
In `@tests/aiperf_mock_server/request_recorder.py`:
- Around line 69-77: The record() and open()/close() methods perform blocking
tokenizer and file I/O (Tokenizer.from_pretrained, tokenizer.encode,
self._file.write, and Path.write_bytes) on the FastAPI event loop; move these
blocking calls off the loop by wrapping them in asyncio.to_thread or scheduling
them on a background thread/worker. Specifically: in open() call
Tokenizer.from_pretrained and open the file via asyncio.to_thread (or create the
tokenizer/file in a startup thread) instead of directly; in record() run
tokenizer.encode and all file.write calls inside asyncio.to_thread or a
threadpool task and await the result; in close() perform the final
Path.write_bytes or file close via asyncio.to_thread. Ensure self._tokenizer and
self._file are set/used only after the threaded operations complete to avoid
race conditions.
In `@tests/aiperf_mock_server/utils.py`:
- Around line 213-234: _maybe_record_request currently calls recorder.record
synchronously on the request path, causing blocking I/O; change it to offload
recording to an async-safe worker (e.g., schedule a background coroutine or call
asyncio.to_thread) instead of invoking recorder.record inline. Specifically,
keep using get_global_recorder(), _extract_request_content(), and
_extract_osl_fingerprint() to build the record payload, but replace the direct
recorder.record(...) call with an async-safe enqueue/schedule: either create a
background task that calls an async wrapper (e.g., _async_record(payload)) or
call asyncio.to_thread(recorder.record, ...) so disk writes happen off the event
loop; ensure any exceptions from the background call are caught/logged and do
not propagate to the request handler.
---
Nitpick comments:
In `@tests/unit/aiperf_mock_server/test_request_recorder.py`:
- Around line 17-31: Test function and parameters in TestHistogram violate
naming/type conventions: rename test_degenerate_inputs to follow
test__histogram_degenerate_inputs_returns_expected and add full type hints to
the signature (e.g., values: Sequence[float] and expected: Optional[Mapping[str,
Any]]) and the return type -> None; update the pytest.mark.parametrize parameter
names if necessary to match the typed signature and keep the same param cases
(single_value, all_equal, empty_returns_none); locate the _histogram usage to
ensure the test name references that function and apply the same
renaming+type-hint pattern to the other affected tests referenced (lines
85-149,151-260) so all tests follow test_<function>_<scenario>_<expected> and
include parameter and return type annotations.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 18423a08-df97-4299-b54c-763f6d0e803b
📒 Files selected for processing (12)
docs/superpowers/plans/2026-05-18-recorder-stats-extension.mddocs/superpowers/specs/2026-05-18-recorder-stats-extension-design.mdtests/aiperf_mock_server/README.mdtests/aiperf_mock_server/app.pytests/aiperf_mock_server/config.pytests/aiperf_mock_server/request_recorder.pytests/aiperf_mock_server/tokens.pytests/aiperf_mock_server/utils.pytests/integration/conftest.pytests/integration/test_mock_server_record_requests.pytests/unit/aiperf_mock_server/__init__.pytests/unit/aiperf_mock_server/test_request_recorder.py
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
There was a problem hiding this comment.
Actionable comments posted: 6
🧹 Nitpick comments (1)
docs/superpowers/plans/2026-05-18-recorder-vocab-distribution.md (1)
129-131: ⚡ Quick winAvoid machine-specific absolute paths in runnable steps.
cd /Users/...makes the plan non-portable for other contributors/CI docs runs. Prefer repo-root-relative commands (or omitcdwhen already at repo root).Also applies to: 1141-1143, 1164-1166, 1250-1258
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/superpowers/plans/2026-05-18-recorder-vocab-distribution.md` around lines 129 - 131, The runnable steps use a machine-specific absolute path (e.g., "cd /Users/fdinatale/Code/aiperf/.worktrees/mock-server-request-recorder") which makes the plan non-portable; update those commands in the document (including the other occurrences noted) to use repo-root-relative paths or omit the cd entirely and run the pytest command from the repo root (replace the absolute-worktree cd with a relative path or nothing and keep the pytest invocation as-is) so contributors and CI can execute the steps without machine-specific paths.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/superpowers/plans/2026-05-18-recorder-stats-extension.md`:
- Around line 1-13: Add the new "Recorder Stats Extension Implementation Plan"
document to the docs site index by adding an entry for that filename under the
superpowers -> plans section of index.yml with matching indentation and
ordering; ensure the YAML entry references the document filename exactly, uses
the same display title, and follows existing docs/**/*.md indexing style so the
page is discoverable in the Fern site.
In `@docs/superpowers/plans/2026-05-18-recorder-vocab-distribution.md`:
- Around line 1201-1227: The fenced stdout sample block that begins with
"Request distribution (100 requests)" is missing a language tag; update its
opening fence from ``` to ```text so the block is marked as plain text (this
will satisfy markdownlint MD040). Locate the triple-backtick block containing
the histogram/vocab output and replace the fence accordingly (reference the
block starting "Request distribution (100 requests)" and the "Vocab used
5234/151936" lines to find it).
In `@docs/superpowers/specs/2026-05-18-recorder-stats-extension-design.md`:
- Around line 41-42: The spec currently conflicts about empty requested OSL:
choose one contract and make the doc and tests consistent—either (A) keep the
"empty OSL mirrors requested_osl: null" contract and update the tests to assert
emb_stats does not contain a requested_osl object (i.e.,
emb_stats["requested_osl"] is null/absent), or (B) adopt the non-null block
contract and update the examples/tests to require emb_stats["requested_osl"] to
be an object with histogram: null and unique_values: 0; update every reference
to requested_osl, emb_stats["requested_osl"]["histogram"], and
emb_stats["requested_osl"]["unique_values"] to match the chosen contract so the
spec and test examples are aligned.
In `@docs/superpowers/specs/2026-05-18-recorder-vocab-distribution-design.md`:
- Around line 98-122: The fenced stdout block starting with the triple-backtick
fence in the recorder-vocab-distribution example is missing a language tag;
update that opening fence from ``` to ```text so the block is marked as plain
text (this will satisfy MD040). Locate the fenced block shown in the diff (the
``` immediately before the "/v1/chat/completions n=100" sample) and change it
to ```text, leaving the content and closing fence unchanged.
- Around line 154-155: The `_render_vocab_lines` docstring incorrectly states a
“four-line stdout block” while enumerating more lines; update the description
for `_render_vocab_lines(vd: dict) -> list[str]` to state the correct line count
and list the exact layout components in order (e.g., headline, top, blank,
shape, ticks/sparkline—adjust count accordingly), and ensure the related note
for `_build_summary` that mentions `vocab_distribution =
_vocab_distribution(...)` remains consistent with the corrected line-count and
layout wording.
- Line 127: On the "top:" description line (the example that shows `"<text>"
<count>` joined with `, `) remove the padded space inside the inline code span
for the delimiter example so MD038 is not triggered; edit the snippet that
currently uses backticks around `, ` to instead use a non-padded inline code
span like `,` or rephrase to a normal inline example without the
trailing/leading space in the code span, keeping the rest of the `"<text>"
<count>` formatting intact.
---
Nitpick comments:
In `@docs/superpowers/plans/2026-05-18-recorder-vocab-distribution.md`:
- Around line 129-131: The runnable steps use a machine-specific absolute path
(e.g., "cd
/Users/fdinatale/Code/aiperf/.worktrees/mock-server-request-recorder") which
makes the plan non-portable; update those commands in the document (including
the other occurrences noted) to use repo-root-relative paths or omit the cd
entirely and run the pytest command from the repo root (replace the
absolute-worktree cd with a relative path or nothing and keep the pytest
invocation as-is) so contributors and CI can execute the steps without
machine-specific paths.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 2268156f-02af-4b6b-94c9-37efe2c98622
📒 Files selected for processing (4)
docs/superpowers/plans/2026-05-18-recorder-stats-extension.mddocs/superpowers/plans/2026-05-18-recorder-vocab-distribution.mddocs/superpowers/specs/2026-05-18-recorder-stats-extension-design.mddocs/superpowers/specs/2026-05-18-recorder-vocab-distribution-design.md
|
I was working on trying to verify #803 due to power mismatches and wanted to try and verify the dataset. Here's what the final output looks like from the mock server. We can see how the ISL/OSL are distributed and the vocab distribution. We can extend this in the future if we need more detailed stats, but this was a start to verify/compare benchmark tools. |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/unit/aiperf_mock_server/test_request_recorder.py (1)
494-501: ⚡ Quick winAdd
# fmt: skipon the new@pytest.mark.parametrizeblock.Please keep this decorator aligned with the repo’s test parametrize formatting rule.
Suggested update
`@pytest.mark.parametrize`( "prompt,expected_ids", [ ([11, 22, 33], [11, 22, 33]), ([[11, 22], [33, 44]], [11, 22, 33, 44]), ], - ) + ) # fmt: skipAs per coding guidelines
tests/**/*.py: “Usefrom pytest import paramand put# fmt: skipon the)line for parametrized tests”.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/aiperf_mock_server/test_request_recorder.py` around lines 494 - 501, The new parametrized test decorator `@pytest.mark.parametrize` on test_record_request_completion_prompt_token_ids_skip_tokenizer must follow test formatting rules: import param from pytest and place a "# fmt: skip" comment on the closing ")" line of the `@pytest.mark.parametrize` block so the formatter leaves the param list as authored; update the decorator usage to use pytest.param entries if needed and add the "# fmt: skip" directly after the closing parenthesis of the decorator.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/aiperf_mock_server/request_recorder.py`:
- Around line 833-845: Update the stale docstring header that currently reads
"Return the 7-line stdout block for one endpoint's vocab_distribution." to
reflect the current output length (change "7-line" to "9-line"), and adjust any
surrounding wording if needed so the docstring accurately describes the
nine-line layout (leave the rest of the layout lines as-is); locate the
docstring by searching for the exact string "Return the 7-line stdout block for
one endpoint's vocab_distribution." in request_recorder.py.
---
Nitpick comments:
In `@tests/unit/aiperf_mock_server/test_request_recorder.py`:
- Around line 494-501: The new parametrized test decorator
`@pytest.mark.parametrize` on
test_record_request_completion_prompt_token_ids_skip_tokenizer must follow test
formatting rules: import param from pytest and place a "# fmt: skip" comment on
the closing ")" line of the `@pytest.mark.parametrize` block so the formatter
leaves the param list as authored; update the decorator usage to use
pytest.param entries if needed and add the "# fmt: skip" directly after the
closing parenthesis of the decorator.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: ae3933c7-1a96-4958-9096-14fa5587940a
📒 Files selected for processing (9)
tests/aiperf_mock_server/README.mdtests/aiperf_mock_server/models.pytests/aiperf_mock_server/request_recorder.pytests/aiperf_mock_server/tokens.pytests/integration/test_mock_server_record_requests.pytests/unit/aiperf_mock_server/test_request_recorder.pytests/unit/server/test_api_compliance.pytests/unit/server/test_models.pytests/unit/server/test_tokens.py
✅ Files skipped from review due to trivial changes (1)
- tests/aiperf_mock_server/README.md
15bb558 to
d0b8df0
Compare
FrankD412
left a comment
There was a problem hiding this comment.
Summary
Nice feature — the recorder is well-structured, the test coverage is generous, and the JSONL-plus-summary contract is a sensible escape hatch for validating ISL/OSL behavior at the wire. Three things to look at before merge, in order:
open(self.path, "wb")truncates the file on each restart. A user running two consecutive benchmarks against the same--record-requests PATHsilently loses the first run. Repro: 20 records → 0 records on restart with no warning. Switch to"ab"(append) or refuse to overwrite without an opt-in flag.- The recorder uses default I/O buffering, so SIGKILL / OOM loses the buffered tail and the entire
.summary.json. Repro:kill -9after 50 records → 48 on disk, no summary at all. Callself._file.flush()after each write. _encode_chat_prompt_idshas atry/except/elsebug: the TypeError-retry's tokens are silently discarded. Falls through to the ChatML fallback. Untested path; harmless today because modern HF doesn't take the TypeError branch, but the retry is effectively dead code.
Also a one-character README fix (Qwen/Qwen3-0.6B → builtin on the --tokenizer default row, which this PR touches anyway).
Working well: per-endpoint stats blocks, the histogram + vocab-distribution rendering (the sparkline is a delight), the tokenization_mode tag on every JSONL row, the validator that rejects --no-tokenizer + --record-requests at parse time, and the integration test that locks down the JSONL schema. Tests pass and the design is right.
… capture Adds an opt-in recording mode to the mock server so we can verify what ISL / requested-OSL distribution actually hits the wire during an aiperf run. When --record-requests PATH is set, the server tokenizes each incoming request inline with the configured tokenizer, appends one JSONL record per request, and writes a per-endpoint distribution summary to <PATH>.summary.json (and stdout) on shutdown. Each record captures the full OSL fingerprint the client sent — raw max_tokens, max_completion_tokens, min_tokens, ignore_eos, reasoning_effort — alongside the resolved requested_osl and the real tokenized isl. The summary aggregates isl / requested_osl / min_tokens distributions and tallies ignore_eos + reasoning_effort categorically. Design choices: - In-process and synchronous in make_ctx. tokenizer.encode() is fast enough that a worker process would be overkill. - Reuses the existing --tokenizer field rather than introducing a separate one, so ISL counts match the vocab the corpus uses. - Config validator forces --workers=1 and rejects --no-tokenizer (incoherent with recording). - conftest.create_server now lets tests opt out of the hard-coded no_tokenizer=True default. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Adds the design for extending the per-endpoint summary with a histogram and unique_values count on the isl and requested_osl stats blocks. Distribution shape rather than just percentiles — makes it possible to tell a uniform sample from a bimodal one without staring at p10/p50/p90 numbers. Bucketing rule: num_bins = max(10, ceil((max-min)/100)). 10-bin floor keeps narrow OSL ranges informative; 100-token max width caps the width on wide ISL ranges. Equal-width bins, observed [min, max]. Output in both <path>.summary.json (parallel bin_edges/counts arrays, pandas-friendly) and stdout (20-char horizontal bar chart). Tests extend the existing recorder integration test. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
… stats Seven TDD-style tasks to extend the per-endpoint summary block with a histogram and unique_values count on the isl and requested_osl fields. Each task has explicit code, exact pytest invocations, expected output, and a commit step. Tasks: 1. _histogram helper + unit tests 2. wire histogram + unique_values into _build_summary 3. _render_histogram helper 4. wire renderer into _print_summary 5. extend integration test 6. refresh README 7. pre-merge lint/tests/pre-commit Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
ISL and OSL stats now print on their own indented rows under the endpoint header, with `:7.1f` / `:5.0f` widths and a Unicode rule. The single-line `ISL mean=... OSL mean=...` cram-on-one-line layout drifted out of alignment when token counts crossed 5 digits. Foundation for the histogram block coming next. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Three small follow-ups from the final code review of the histogram + unique_values feature: - _render_histogram: tighten `hist` annotation to `dict[str, list[float] | list[int]]` so it matches what `_histogram` actually returns. - _print_summary: pass `sum(histogram["counts"])` as the histogram's `n=`, not `stats["count"]`. Matters when OSL observations are fewer than ISL (e.g. requests that omit max_tokens). - Test the non-round-span path of `_histogram` (range 0..1001, 11 bins) so the last-edge pinning is exercised when the divisor isn't clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Four cosmetic divergences in the Summary stdout sample picked up by the final review: - Separator: 86 → 46 chars to match `_print_summary`'s rule width. - Endpoint header: drop `n= 4000` padding; code emits `n=4000`. - min_tokens row: use `mean ` / `p50 ` (no `=`) per current format. - min_tokens / ignore_eos / reasoning_effort: 6-space → 4-space indent. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Adds per-endpoint `vocab_distribution` block to the recorder summary that surfaces how the dataset samples across the tokenizer's vocab: coverage (unique IDs / vocab size), top-N concentration, Shannon entropy with the uniform-sampling ceiling, an 80-bucket sparkline over full token-id space (log-y), and a full token_id → count frequency table in JSON. Stacked stdout layout with blank lines between blocks; spacing refresh applies to the existing ISL/OSL histogram rendering as well. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
…stats Nine TDD-style tasks. Each task has explicit code, exact pytest invocations, expected output, and a commit step. 1. Capture per-endpoint token-id Counter in RequestRecorder. 2. _compute_shape_80 helper. 3. _vocab_distribution helper (JSON block builder). 4. _render_vocab_lines helper (stdout block). 5. Wire vocab_distribution through _build_summary. 6. Wire renderer + blank-line spacing into _print_summary. 7. Extend integration test. 8. Refresh README. 9. Pre-merge verification. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Important fixes: - _build_summary now reaches `_resolve_vocab_size` when vocab_size_source is "observed" — previously the guard required vocab_size to be non-None, which left the observed-fallback path dead and silently emitted vocab_distribution=null for tokenizers without a vocab-size attribute. - _vocab_distribution now replaces non-printable decoded text with the `<id=N>` marker — previously only decode-raises hit the fallback, so control chars / null bytes leaked into top_tokens and stdout. - Added end-to-end test exercising the observed path through open()/record()/close() for tokenizers with no vocab-size signal. Minor cleanups also from review: - round max_entropy_bits to 4 decimals for symmetry with entropy_bits. - replace the decode_fn lambda with a named inner function (no noqa). - tighten test_blank_lines_between_blocks to verify blank-line placement rather than just "at least one blank somewhere". - fix incorrect comment in test_log_y_makes_small_bars_visible. - tighten integration test's emb_vd assertion (vocab block is always non-null for endpoints with recorded ISL). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Open the recorder file in append mode so a second run against the same --record-requests path accumulates instead of silently truncating prior data on lifespan startup, and flush after every record write so the on-disk file matches in-process state — abnormal exit (SIGKILL, OOM) now loses only the in-flight record instead of the buffered tail and the .summary.json that close() would have written. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
`_encode_chat_prompt_ids` wrapped both apply_chat_template calls in one try/except/else, but Python's `else` only runs when the try body itself completed without exception — so when the positional call raised TypeError and the `conversation=` kwarg retry succeeded, the retry's tokens were bound to `result` and then immediately discarded, and the function fell through to the ChatML fallback. Extract the call into a small helper that returns the result outside any try/except, so the retry's tokens actually reach the caller. Add a unit-test stub whose apply_chat_template is keyword-only to lock down the retry branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
The Configuration table claimed Qwen/Qwen3-0.6B; the actual default in config.py is "builtin" (bundled tiktoken o200k_base, zero network access). Also spell out what "builtin" means inline so readers don't have to chase the field description. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
When the configured tokenizer has no chat template (e.g. the builtin tiktoken o200k_base) the recorder falls back to literal ChatML rendering, and each role marker is tokenized as ~5 raw tokens instead of the single special-token id a real chat template would produce. For builtin + a single user message this is +21 tokens over the bare user content; for a 4-message conversation it is roughly +45. Without this note, anyone validating --isl-mean N against recorder output sees a systematic positive offset and may chase it as a bug in the client. The tokenization_mode field already flags which path was used; the note tells readers what to do about it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
The diagram block under "Auto-Scaling GPU Metrics" was an untagged fence, which made some Markdown renderers (incl. the Fern site linter) emit a "language specifier missing" warning. Tagging it `text` keeps the rendered output identical and silences the linter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Same fix as the Token Flow block — the directory-tree fence was untagged, tripping the docs linter. Pure rendering parity. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Append mode left a reused --record-requests path with old + new records on disk, but the per-process stats accumulators start empty each run, so <path>.summary.json described only the latest run. Two consumers of the same artifact pair would see different totals. Truncate per run so the JSONL and the summary are always consistent (reviewer dynamo-ops). Per-record flush is unchanged, so SIGKILL / OOM still only loses the in-flight record. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
The lifespan teardown ran `await shutdown_scheduler()` and `recorder.close()` inside the same `finally:` block, so any exception from the scheduler shutdown step (timeout, ZMQ socket error, etc.) bypassed the recorder cleanup. That meant `<path>.summary.json` was silently never written — the artifact the user enabled `--record-requests` for in the first place. Nest the recorder cleanup in its own inner `finally:` so it runs whether or not scheduler shutdown succeeds. Add a unit test that patches `shutdown_scheduler` to raise and asserts the summary file still ends up on disk; this previously failed by leaving the summary missing. Reported by reviewer dynamo-ops. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
8315e6d to
746f6d7
Compare
debermudez
left a comment
There was a problem hiding this comment.
Tackle the 2 comments from dynamo ops and this is going in. Nice work!
…kend `_tokenizer_call_ids()` unwrapped the AIPerf `Tokenizer` and invoked the HF backend directly, which defaults to `add_special_tokens=True`. The wrapper's own `__call__` configures `add_special_tokens=False`, so recorded ISL counts for HF completion / TGI / embedding requests were inflated by the backend's BOS/EOS sentinels. Try the wrapper's `__call__` first and only fall back to unwrapping for objects that don't expose it. Add a regression test that distinguishes wrapper vs backend invocations and asserts the wrapper's 3-id result is recorded (rather than the backend's 5-id, special-token-padded result). Reported by reviewer dynamo-ops. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
`TGIGenerateRequest` has no `stream` field, so `getattr(req, "stream", None)` returned `None` for the streaming TGI endpoint. The recorder treated that as falsy and never incremented `streamed_count`, leaving the per-endpoint summary blind to TGI streaming traffic. Derive `stream` from the endpoint when it's `/generate_stream`, fall back to the request attribute everywhere else. Add two unit tests: one that asserts `/generate_stream` records `stream=True`, one that pins `/generate` to the existing falsy behavior. Reported by reviewer dynamo-ops. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
`init_scheduler(server_config)` runs after `recorder.open()` and `set_global_recorder(recorder)`, but lives outside the `try: yield ... finally:` cleanup block. So if `init_scheduler` raises during startup, the exception propagates out of `lifespan` before the cleanup `finally` is ever entered — the global recorder stays installed, the JSONL file handle leaks, and `<path>.summary.json` is never written. Wrap `init_scheduler` in its own `try/except`. On any startup failure, unregister the global recorder and call `close()` (which writes the summary) before re-raising. The existing yield/shutdown cleanup is untouched and owns the successful-startup path. Add a unit test symmetric to the existing shutdown-failure one: patch `init_scheduler` to raise, drive the lifespan, then assert that the summary file landed on disk and that `get_global_recorder()` is None. Reported by reviewer dynamo-ops. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Frank Di Natale <3429989+FrankD412@users.noreply.github.com>
Summary
--record-requests PATHmode to the mock server that tokenizes each incoming request inline (using the configured--tokenizer) and writes one JSONL record per request, plus a per-endpoint distribution summary on shutdown (<PATH>.summary.jsonand stdout).max_tokens,max_completion_tokens,min_tokens,ignore_eos,reasoning_effort— not just the resolvedrequested_osl.min/max/mean/stdev/p50/p90/p95/p99forisl,requested_osl, andmin_tokens, plusunique_valuesand an equal-width histogram (num_bins = max(10, ceil((max-min)/100))) onislandrequested_osl. Stdout rendering uses a 20-char horizontal bar chart.--record-requestsforces--workers=1(in-process state needs a single producer) and rejects--no-tokenizervia the pydantic validator (recording with no tokenizer is incoherent).Purpose: lets you verify what ISL / requested-OSL distribution actually hits the wire during an aiperf run — useful for validating
--random-range-ratio,--strict-isl, and other client-side ISL/OSL synthesis features against ground truth at the server.Design notes (spec, plan):
make_ctx; reuses the corpus tokenizer rather than introducing a separate one._extract_request_contenthelper for text + resolved OSL; adds_extract_osl_fingerprintfor the raw fields.Test plan
uv run pytest tests/unit/aiperf_mock_server -v(19 tests)uv run pytest tests/integration/test_mock_server_record_requests.py -m integration -v(3 tests)MockServerConfig(record_requests="/tmp/x", no_tokenizer=True)raisesValidationErrorwith message containing "requires a tokenizer"--no-tokenizer+--record-requestserrors at config parse time, not inside a request handler.🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Documentation
Tests