fix: the nightly by fixing test docs end to end test suite and staging locations#965
Conversation
Test-docs end-to-end:
- src/aiperf/config/dataset/resolver.py — _check_timing_data now reports
True for BURST_GPT_TRACE (matches the existing SAGEMAKER_DATA_CAPTURE
branch); the loader's _REQUIRED_COLUMNS guarantees a Timestamp column,
but the JSONL-only first-record probe always failed on CSV and blocked
--fixed-schedule. New regression test in tests/unit/config/test_resolvers.py.
- docs/tutorials/embeddings.md — gets its own vllm-embeddings-openai
server (setup + health-check + retagged run markers). The BGE
commands previously routed to the Qwen3-0.6B chat server and 404'd
on /v1/embeddings.
- tests/ci/test_docs_end_to_end/constants.py — AIPERF_COMMAND_TIMEOUT
300 → 600s; the long-conversations / goodput / aimo tutorials run
long on the AMD GPU runner and were SIGKILLed mid-flight.
AMD64 wheel-test:
- pyproject.toml — extract a [test] extra from [dev]; [dev] keeps
tooling-only deps and recursively consumes [test]. Zero impact on
regular pip install aiperf users (extras are opt-in).
- .github/workflows/nightly.yml — wheel-test step now installs
"${NIGHTLY}[test]" instead of a hand-typed list; eliminates the drift
that left jsonschema unlisted and broke collection of
tests/unit/config/test_config_schema_generator_integration.py.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Temporarily redirect the workflow_dispatch / checkout / SHA-verification gates from main to harrison/fix-nightly so the nightly-fix commit can be exercised end-to-end before merge. Each modified block is tagged `TEMPORARY` with a revert-before-merge note, and the dispatch-input description echoes the same warning. Note: GitHub reads the workflow file for `schedule:` triggers from the default branch only, so the cron will not fire on this branch — runs must be triggered via `gh workflow run nightly.yml --ref harrison/fix-nightly`. Revert this commit before merging the PR. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
- stage-nightly-wheel uploads to nightly/${TIMESTAMP}/${FN} so wheels are
grouped by date; TIMESTAMP and the wheel filename's embedded date come
from the same prepare-nightly output, so they cannot drift.
- POST /api/pypi/<repo>/reindex after the deploy so Artifactory's PyPI
simple index picks up the new wheel without manual recalc (generic
PUT deploys bypass the metadata extractor).
- Pass ARTIFACTORY_WHEEL_PATH, ARTIFACTORY_NIGHTLY_WHEEL_PATH, TIMESTAMP
to the GitLab security-scan trigger so downstream consumers get the
exact location instead of having to know the layout convention.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
vllm-default-openai owns 39 of the ~53 aiperf commands extracted from docs/tutorials/, and ran them sequentially in a single GPU job — ~57 min of the previous 73-min wall-clock. Split it into 4 contiguous shards plus one shard per modality-specific server (audio, vision, video, embeddings), 8 in total. Each shard runs on its own runner from the prod-aiperf-tester-amd-gpu-v1 pool (maxRunners=16, podAntiAffinity by host so each gets its own GPU and isolated localhost:8000), brings up its own vLLM, executes its slice, and tears down. Lives entirely in the called workflow (test-docs-end-to-end.yml), so PR / workflow_dispatch / nightly's workflow_call all get the matrix. - tests/ci/test_docs_end_to_end/main.py: new --server / --shard-index / --shard-total flags. Argparse rejects invalid combos. Contiguous-chunk slicing keeps at most one slow tutorial per shard (assuming the slow ones aren't clustered in the docs). - tests/ci/test_docs_end_to_end/setup_test.sh: forwards SERVER_NAME / SHARD_INDEX / SHARD_TOTAL env vars from the matrix. Unset env vars preserve the existing --all-servers behavior for local invocation. - .github/workflows/test-docs-end-to-end.yml: matrix of 8 shards, fail-fast: false. The aggregated job result that nightly.yml consumes is unchanged in shape (success iff all shards pass). Expected wall-clock: ~22-25 min total, down from 73 min. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
The previous nightly's vllm-default-openai-1of4 shard hit the 600s
timeout on test 5 (aimo public-dataset) — the same test ran in 225s
in the prior serial run. Under the 8-way parallel matrix, shared
infra (HuggingFace API, ECR pulls, cold-start network contention)
slowed the HF-dataset-heavy tests by ~2.7x. Two complementary fixes:
1. Bin-pack commands by author-annotated weight instead of by docs
order. Tag syntax now accepts an optional ``weight=<seconds>`` hint:
`<!-- aiperf-run-...-endpoint-server weight=500 -->`. Parser extracts
it (default 60s); the sharder runs LPT greedy bin-packing, sorting
commands heaviest-first and assigning each to the currently-lightest
shard. The three known-slow tests are tagged:
- docs/tutorials/aimo.md: weight=500
- docs/tutorials/goodput.md: weight=600
- docs/tutorials/multi-turn.md (long-conversations): weight=1100
At 4 shards on vllm-default-openai, this produces shard weights
[1100, 1080, 1100, 1080] — long-conv gets its own shard; aimo and
goodput each ride along with ~10 quick tests.
2. Raise AIPERF_COMMAND_TIMEOUT 600s -> 1200s as a safety net for
parallel-load variance. The bin-packer should prevent any one shard
from being over-weighted, but per-command runtime under contention
is bursty (HF rate-limits, ECR throttling) and we'd rather wait than
spuriously fail.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
GitHub is forcing Actions runners to Node 24 starting June 2nd, 2026 and removing Node 20 on September 16th, 2026. Bump every action that GitHub flagged in the deprecation annotations to the version that already targets Node 24. The same versions are already in use in run-unit-tests.yml, so this is just bringing the rest of the workflows into the same baseline. - actions/checkout @v4 -> @v6 - actions/setup-python @v4/@v5 -> @v6 - actions/upload-artifact @v4 -> @v5 - actions/download-artifact @v4 -> @v5 - actions/setup-node @v4 -> @v5 (fern-docs / run-fern-checks) - astral-sh/setup-uv @v5 -> @v7 (nightly only) Other actions (github-script, codecov, pre-commit, docker/setup-buildx, ytanikin/pr-conventional-commits) weren't on the deprecation list and are left as-is. Also: gate prepare-nightly's "Restore version artifact from prior attempt" step on github.run_attempt > 1. The step is intentionally written with continue-on-error: true so a missing artifact is fine on fresh runs, but the action still emits a "Artifact not found" warning annotation every time. Skipping on first attempts removes the noise; continue-on-error stays as belt-and-suspenders for the case where a prior attempt's artifact expired before re-run. Files touched: .github/workflows/{nightly,pre-commit,test-docs-end-to-end, fern-docs,run-fern-checks,run-integration-tests}.yml Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
…ft report The first weighted-sharding nightly (run 26095921607) finished with shard wall-clocks ranging from 12m to 40m+. Two distinct problems: 1. Declared weights drifted from reality. long-conv was tagged 1100s but ran 343s; goodput was tagged 600s but ran 243s; meanwhile blazedit_10k ran 700s with no weight at all, and burst-gpt-trace ran 544s with no weight. Shard 4 picked up the two heaviest unweighted tests and blew past the 10-min per-shard estimate by ~30 min. 2. fixed-schedule.md:122 failed in 1s when it landed in a different shard than fixed-schedule.md:58 — the second block uses precise_schedule.jsonl which the first block creates via cat-heredoc. Sequential test runs accidentally work because the same container working dir persists; sharding broke that assumption. Changes: - Rebalance weights from observed runtimes (observed × 1.3, rounded to 50): - blazedit.md:107 (blazedit_10k): default -> weight=900 - burst-gpt-trace.md:63: default -> weight=700 - multi-turn.md:288 (long-conv): weight=1100 -> weight=500 - multi-turn.md:257 (conv 50): default -> weight=350 - aimo.md:53: weight=500 -> weight=350 - goodput.md:43: weight=600 -> weight=350 - custom-dataset.md:337 (random_pool): default -> weight=250 - multi-turn.md:199 (conv 100): default -> weight=250 - sharegpt.md:38: default -> weight=200 - synthetic-dataset.md:65: default -> weight=200 - blazedit.md:54 (blazedit_5k): default -> weight=150 - default weight bump: 60s -> 80s (mean of observed unweighted is ~95s) - Make fixed-schedule.md:122 self-contained by inlining the precise_schedule.jsonl cat-heredoc, so the time-window example runs in any shard regardless of whether the basic example also landed there. - Fix LPT non-determinism in tests/ci/test_docs_end_to_end/main.py: Path.rglob returns filesystem-dependent order (macOS vs Linux differ), so the same code assigned commands to different shards locally vs in CI. Add (file_path, start_line) as a stable secondary sort key after the weight-descending primary key. - Add tools/suggest_test_docs_weights.py and a non-blocking suggest-shard-weights nightly job. The tool parses test-docs matrix logs, matches each test back to its markdown bash block, and writes a markdown table of suggested weight= updates to the GitHub Actions step summary. Purely informational — flagged by ratio>1.5 with abs delta >60s, by default-weighted tests observed >120s, and by tests over-weighted by 2x or more. Continues on error so it never blocks the pipeline. Predicted 4-shard balance after rebalance: loads [1660, 1580, 1610, 1590], spread 80s (down from ~1600s spread in run 26095921607). Expected wall-clock per shard: ~34 min (boot ~7 min + work ~27 min), floor pinned by the 900s blazedit_10k test. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
The file has a `#!/usr/bin/env python3` shebang but git didn't track the +x bit when the file was first added in dfc9082 (it was created via the editor tool, which doesn't carry exec mode). pre-commit's check-shebang-scripts-are-executable hook flagged this on the 2026-05-19 nightly run. Fix: `git update-index --chmod=+x` to flip the bit in the git index. Working-tree mode was already 0755 locally; this propagates that to git so the hook is satisfied across fresh clones too. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
The suggest-shard-weights job already writes its markdown to $GITHUB_STEP_SUMMARY, but step summaries only render on the run's top-level page, not inside the per-job log view. Add two more output paths so the report is discoverable no matter where someone clicks: 1. Echo the report inside a collapsible ::group:: in the GHA log, so it's right there on the job-detail page. 2. Upload it as a `suggest-shard-weights` artifact with 30-day retention, so it can be downloaded via `gh run download` long after the run scrolls off the recent-runs list. Same step summary content as before; the file is written to suggest-weights.md in the workspace and then routed three ways. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Today every nightly run produces the same WHEEL_VERSION,
CONTAINER_TAG, and staging subpath regardless of trigger.
Same-day same-commit re-runs (manual or otherwise) clobber each
other in S3, Artifactory, ECR, and NGC — already documented as
"last-writer-wins" in the existing comments.
Disambiguate workflow_dispatch runs end-to-end:
1. PEP 440 local-version suffix on WHEEL_VERSION:
0.8.0.dev<YYYYMMDD>+manual.<run_id>
- Wheels are uniquely addressable per manual dispatch.
- `pip install aiperf-nightly` keeps resolving the canonical
scheduled wheel by default (non-local versions sort above
identical base+local in PEP 440).
- `pip install aiperf==<base>.dev<date>+manual.<run_id>`
pulls a specific manual build.
- .github/actions/update-pyproject-version already accepts
local-version segments, no infra change needed there.
2. New STAGE_SUBPATH output from prepare-nightly:
nightly/<YYYYMMDD> for scheduled runs
nightly-manual/<YYYYMMDD>/<run_id> for manual dispatches
Consumed by stage-nightly-wheel's Artifactory TARGET_URL and
by trigger-gitlab-security-scan's ARTIFACTORY_WHEEL_PATH /
ARTIFACTORY_NIGHTLY_WHEEL_PATH forwarded variables.
S3 upload stays keyed on WHEEL_VERSION so the upload/download
symmetry holds for free — the local-version suffix already
distinguishes the paths.
3. CONTAINER_TAG branch in prepare-nightly:
nightly-<YYYYMMDD>-<sha7> for scheduled
nightly-manual-<YYYYMMDD>-<sha7>-<run_id> for manual
No downstream edits needed; build-container, the multi-arch
manifest job, stage-nightly-container, and the GitLab trigger
all consume CONTAINER_TAG as-is.
4. NGC staging destination repo renames aiperf -> aiperf-nightly
in stage-nightly-container (5 occurrences in the crane copy +
index-append + delete lines). Mirrors the PyPI naming scheme
(`aiperf` canonical dist / `aiperf-nightly` nightly-flavored
dist). ECR source side stays `aiperf` — purely an NGC-side
rename. Operational pre-req: the `aiperf-nightly` NGC repo
must exist with publish-service-account write access before
the next staged dispatch.
Also: remove the explicit Artifactory PyPI /reindex POST that was
returning 403 with the current ARTIFACTORY_USER. The user has
Deploy permission but not Manage Repository; widening that grant
is tracked separately. Until then, simple-index visibility for
freshly-uploaded wheels relies on the repo's auto-reindex
mechanism or an out-of-band UI recalc.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Default for manual workflow_dispatch runs is unchanged: artifacts get the +manual.<run_id> wheel suffix, nightly-manual/<date>/<run_id>/ staging path, and nightly-manual-<date>-<sha>-<run_id> container tag (everything we just added in b7fabee). Adds an opt-out: setting `manual_artifacts=false` on a manual dispatch falls back to the bare scheduled-nightly naming (no local-version suffix, plain `nightly/<date>/` path, plain `nightly-<date>-<sha>` container tag). Useful for emergency re-publishes where a manual run is *meant* to overwrite the scheduled artifact locations — e.g. re-running today's nightly because the cron-triggered build failed. Schedule events ignore the input entirely; the GITHUB_EVENT_NAME check short-circuits before MANUAL_ARTIFACTS is consulted. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Use Artifactory's matrix-parameter syntax on the PUT URL to attach two properties to every wheel staged by stage-nightly-wheel: github.run.url = https://github.com/<org>/<repo>/actions/runs/<run_id> git.commit.sha = <full commit SHA the wheel was built from> Both properties are visible in the Artifactory UI under each wheel's Properties tab, included in Xray scan metadata, and queryable via the AQL API ("show me every wheel built from SHA X" or "show me every wheel from yesterday's nightly run"). URL value is URL-encoded with python3 (always available on the builder runner) so the ':' and '/' characters in the GitHub Actions URL don't confuse Artifactory's matrix-parameter parser. Path URL (without properties) is still what gets logged in the 'Wheel staged to Artifactory:' line for readability — the properties appear as a separate log line below. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
yarl is transitive via aiohttp, but the Dockerfile's env-builder runs a cold `uv sync` (no --frozen, no uv.lock COPY) so resolution is fresh against PyPI on every build. That makes us vulnerable to the few-hours window after upstream pushes a new release with incomplete wheel coverage. Specifically today: yarl 1.24.0 uploaded at 17:23 UTC with only cp310 wheels — no cp311/12/13, no sdist. Our env-builder runs on python:3.13-slim-bookworm, so uv picked 1.24.0 (the new "latest" per PyPI metadata) and bailed with "doesn't have a source distribution or wheel for the current platform", taking out every test-docs-end-to-end shard for the run. Direct constraint on a transitive dep is unusual but legit — the resolver honors it during `uv sync` regardless of who pulls yarl in. `<2.0` keeps us on the current major; the comment in pyproject.toml flags it as removable once upstream's release pipeline becomes reliable about uploading all platform wheels atomically. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
The prepare-nightly job's version-info file changes extension from .env to .txt to match what it actually is — a human-readable summary of the resolved version strings for the run. The format stays as shell-source-able KEY='value' lines so the same file restores correctly on re-attempts. Two visibility wins: 1. The contents are echoed into a collapsible "Resolved version info" group in the workflow log right after they're computed. No need to download the artifact (or click into a sub-job) to see what WHEEL_VERSION / STAGE_SUBPATH / CONTAINER_TAG resolved to. 2. The archived artifact (`nightly-version`) now contains version-info.txt — same file, more honest extension. Downstream tools that grep / `cat` the artifact don't care about the extension; tools that `source` it still work because the format didn't change. Migration safety: the restore step probes for version-info.txt first, then falls back to version-info.env. Re-attempts of runs started before this change still find their snapshot. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@15e3fb41cb1d0226eb5f355e5e6da439afd7f7edRecommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@15e3fb41cb1d0226eb5f355e5e6da439afd7f7edLast updated for commit: |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThis PR adds weight-based end-to-end test sharding and a weight-suggestion tool, annotates tutorials with weight markers, updates GitHub Action versions across workflows, and enhances the nightly workflow with manual-artifact naming, stage_subpath outputs, provenance metadata, and staged wheel/container changes. ChangesEnd-to-End Test Sharding and Weight System
Weight Suggestion Analysis Tool
Tutorial Documentation Weight Annotations
CI Action Updates and Nightly Workflow Enhancements
Configuration and Test Infrastructure
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
.github/workflows/nightly.yml (1)
1042-1084: 💤 Low valueConsider adding
persist-credentials: falsefor defense in depth.The job design is solid: non-blocking with
continue-on-error: true, good visibility via log group + step summary + artifact, and correctifcondition to skip when the upstream job is skipped.Static analysis flags the checkout step (Line 1056) for missing
persist-credentials: false. While this job only reads data and poses low risk, adding the flag is a low-effort hardening:🛡️ Suggested hardening
- uses: actions/checkout@v6 with: ref: ${{ needs.prepare-nightly.outputs.resolved_sha }} + persist-credentials: false🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/nightly.yml around lines 1042 - 1084, The actions/checkout step in the suggest-shard-weights job should include persist-credentials: false to avoid leaving repository credentials in the runner workspace; update the checkout step that uses actions/checkout@v6 (inside the suggest-shard-weights job) to add with: persist-credentials: false while keeping the existing ref: ${{ needs.prepare-nightly.outputs.resolved_sha }} entry so the job still checks out the correct commit.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/nightly.yml:
- Around line 93-122: Revert the three TEMPORARY branch-gate changes so the
workflow targets main: change the "Require dispatch from harrison/fix-nightly"
step's conditional/message to require/main reference to main, update the
"Checkout" step so its ref uses inputs.sha || 'main' instead of
'harrison/fix-nightly', and update the "Verify SHA is on harrison/fix-nightly"
step to verify against origin/main (and adjust its step name/message
accordingly); ensure the semantics (rejecting dispatches not on the target
branch and verifying merge-base) remain the same after switching to main.
In @.github/workflows/test-docs-end-to-end.yml:
- Line 39: Replace the mutable tag usages of the checkout action with pinned
immutable commit SHAs: locate every occurrence of "uses: actions/checkout@v6"
(the three instances referenced in the diff) and replace each with the
corresponding full commit SHA form "uses: actions/checkout@<full-commit-sha>";
ensure you use the exact commit SHA for the version you intend to pin and update
all three occurrences consistently so the workflow satisfies the immutable-SHA
policy.
- Around line 39-42: Update both uses of the actions/checkout@v6 steps (the
checkout that performs read-only git operations and the subsequent checkout
before tests) to explicitly set persist-credentials: false; locate the two
occurrences of "uses: actions/checkout@v6" in the workflow and add the key
"persist-credentials: false" under their "with:" blocks so credentials are not
persisted to the workspace.
In `@tests/unit/config/test_resolvers.py`:
- Line 499: Rename the test function test_burst_gpt_csv_reports_timing_data to
follow the repo convention test_<function>_<scenario>_<expected> and add an
explicit return type annotation -> None; for example rename
test_burst_gpt_csv_reports_timing_data to
test_burst_gpt_csv_reports_timing_data_records_timing and change its signature
to include -> None so the function definition
(test_burst_gpt_csv_reports_timing_data / new name) matches the test naming and
type-hinting guidelines.
In `@tools/suggest_test_docs_weights.py`:
- Around line 239-247: The current mapping by_bash built with
{_canonicalize_bash(c.command): c for c in commands} loses duplicates because
later commands overwrite earlier ones, causing ambiguous bash-to-command
matches; update the logic that builds by_bash to collect all Command entries per
canonicalized bash (e.g., map string -> list[Command] or detect duplicates) and
then update the matching loop over runs to handle multiple candidates: when
by_bash returns multiple Commands for r.bash_content, disambiguate using
additional metadata (file, line, tutorial id) or skip/record as ambiguous
instead of silently overwriting, ensuring Suggestion creation uses the correct
Command instance(s) and increments matched/unmatched appropriately; refer to
_canonicalize_bash, by_bash, runs, commands, and Suggestion to locate where to
change the collection and matching logic.
---
Nitpick comments:
In @.github/workflows/nightly.yml:
- Around line 1042-1084: The actions/checkout step in the suggest-shard-weights
job should include persist-credentials: false to avoid leaving repository
credentials in the runner workspace; update the checkout step that uses
actions/checkout@v6 (inside the suggest-shard-weights job) to add with:
persist-credentials: false while keeping the existing ref: ${{
needs.prepare-nightly.outputs.resolved_sha }} entry so the job still checks out
the correct commit.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: f981486f-3d02-4e30-a4f8-3dd03eaa4f89
📒 Files selected for processing (25)
.github/workflows/fern-docs.yml.github/workflows/nightly.yml.github/workflows/pre-commit.yml.github/workflows/run-fern-checks.yml.github/workflows/run-integration-tests.yml.github/workflows/test-docs-end-to-end.ymldocs/tutorials/aimo.mddocs/tutorials/blazedit.mddocs/tutorials/burst-gpt-trace.mddocs/tutorials/custom-dataset.mddocs/tutorials/embeddings.mddocs/tutorials/fixed-schedule.mddocs/tutorials/goodput.mddocs/tutorials/multi-turn.mddocs/tutorials/sharegpt.mddocs/tutorials/synthetic-dataset.mdpyproject.tomlsrc/aiperf/config/dataset/resolver.pytests/ci/test_docs_end_to_end/constants.pytests/ci/test_docs_end_to_end/data_types.pytests/ci/test_docs_end_to_end/main.pytests/ci/test_docs_end_to_end/parser.pytests/ci/test_docs_end_to_end/setup_test.shtests/unit/config/test_resolvers.pytools/suggest_test_docs_weights.py
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
The 5600262 commit on this branch pointed nightly.yml's dispatch gates at harrison/fix-nightly so we could exercise the full pipeline end-to-end against the in-flight branch. Now that the PR is passing CI and getting merged, restore the gates to main: - workflow_dispatch input description for `sha` says "Must be on main" again. - "Require dispatch from main" check rejects dispatches from any other ref. - Checkout default ref goes back to 'main'. - "Verify SHA is on main" runs merge-base against origin/main. Net effect after merge: scheduled cron continues to build tip of main; manual dispatches must come from main and pass an ancestor check against origin/main. Behavior is identical to pre-56002626 baseline. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
1. tools/suggest_test_docs_weights.py — by_bash dedup
The previous {_canonicalize_bash(c.command): c for c in commands}
silently dropped all-but-one of any markdown commands sharing
canonicalized bash. No tutorials collide today but a future
duplicate would be invisible. Switch by_bash to dict[str,
list[Command]] (defaultdict + append), log an INFO line naming
every colliding markdown location at collection time, and emit
one Suggestion per matched Command in the loop. Matched/unmatched
counts still increment per-run (one TestRun = one or zero
matches), but the suggestion list can carry > one row per run
when bash collides — appropriate because all matched markdown
locations need the same weight update.
2. tests/ci/test_docs_end_to_end/dump_config_hash.py — include weight
The detect-changes hash skipped the GPU e2e workflow whenever a
PR touched only weight= annotations, even though shard
assignment depends on weight. Hash each aiperf_command as
[c.command, c.weight] so weight-only edits invalidate the hash
and trigger the e2e job. Verified locally: flipping
weight=350 -> weight=351 on one tutorial changes the hash.
3. docs/tutorials/embeddings.md — combine cat + aiperf into one block
The previous structure put the aiperf-run tag on a `cat > inputs.jsonl`
block followed by a separate ungtagged bash block with `aiperf
profile`. The parser only grabs the FIRST bash block after a tag,
so CI was running the cat (a no-op) and reporting success without
ever exercising BGE. Merge into one tagged block so the runner
actually invokes aiperf profile against the BGE embedding
endpoint. Default weight (80s) is sufficient: the companion
synthetic-BGE test in the same file is also default-weighted and
the whole vllm-embeddings-openai shard finishes in 7m02s
end-to-end; this custom-input variant uses --request-count 5 so
it's strictly smaller than its sibling.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tools/suggest_test_docs_weights.py (1)
259-305:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAggregate duplicate-bash runs before emitting suggestions.
When the same canonicalized bash appears in multiple markdown blocks and both blocks run, Lines 259-305 emit suggestions for every
(run, command)pair. That turns two real executions into four rows and can give the same file conflicting suggested weights if the runtimes differ. Collapse runs to one observed value perbash_contentbefore iteratingcmds(for example, keep the max observed runtime), then derivematchedfrom that deduped set so the summary stays consistent.Proposed fix
- matched = 0 unmatched = 0 - suggestions: list[Suggestion] = [] + observed_by_bash: dict[str, TestRun] = {} for r in runs: cmds = by_bash.get(r.bash_content, []) if not cmds: unmatched += 1 log.debug("No markdown match for test %d in %s", r.test_index, r.job_name) continue - matched += 1 if r.status == "failed": # A 1-second-fail test gives us no useful timing; skip. continue + prev = observed_by_bash.get(r.bash_content) + if prev is None or r.actual_seconds > prev.actual_seconds: + observed_by_bash[r.bash_content] = r - # When several markdown blocks share canonicalized bash, the run's - # runtime applies to all of them — they need the same weight, so - # emit one Suggestion per Command pointing at each markdown location. - for cmd in cmds: + matched = sum(len(by_bash[bash]) for bash in observed_by_bash) + suggestions: list[Suggestion] = [] + for bash_content, r in observed_by_bash.items(): + # Emit one Suggestion per markdown location using the selected runtime + # for this canonicalized bash block. + for cmd in by_bash[bash_content]: current = cmd.weight observed = r.actual_seconds suggested = _round_to(int(observed * SAFETY_MARGIN))🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/suggest_test_docs_weights.py` around lines 259 - 305, The current loop emits a Suggestion for every (run r, Command cmd) pair causing duplicate suggestions when the same r.bash_content appears in multiple runs; instead, aggregate runs by bash_content (the by_bash key) before emitting suggestions: build a map from bash_content to a single observed_seconds (e.g., the max of r.actual_seconds across runs with that bash_content) and a representative run metadata (job_id, job_name, passed_line_index) and compute matched/unmatched from that deduped set, then iterate cmds = by_bash[bash_content] and use the aggregated observed_seconds (and the representative run for anchor) when computing observed, suggested, ratio, and appending Suggestion objects (keeping references to DEFAULT_WEIGHT, SAFETY_MARGIN, default_threshold, threshold, suggestions, Suggestion, cmd.start_line, cmd.file_path as in the diff).
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Outside diff comments:
In `@tools/suggest_test_docs_weights.py`:
- Around line 259-305: The current loop emits a Suggestion for every (run r,
Command cmd) pair causing duplicate suggestions when the same r.bash_content
appears in multiple runs; instead, aggregate runs by bash_content (the by_bash
key) before emitting suggestions: build a map from bash_content to a single
observed_seconds (e.g., the max of r.actual_seconds across runs with that
bash_content) and a representative run metadata (job_id, job_name,
passed_line_index) and compute matched/unmatched from that deduped set, then
iterate cmds = by_bash[bash_content] and use the aggregated observed_seconds
(and the representative run for anchor) when computing observed, suggested,
ratio, and appending Suggestion objects (keeping references to DEFAULT_WEIGHT,
SAFETY_MARGIN, default_threshold, threshold, suggestions, Suggestion,
cmd.start_line, cmd.file_path as in the diff).
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 0cddbf57-57ea-4226-b823-865a8feaca06
📒 Files selected for processing (3)
docs/tutorials/embeddings.mdtests/ci/test_docs_end_to_end/dump_config_hash.pytools/suggest_test_docs_weights.py
…hots
The compute step writes STAGE_SUBPATH on $GITHUB_OUTPUT (line 220),
but legacy version-info.env files — from before the
scheduled/manual split — don't define it. Under `set -u` (the step
uses `set -euo pipefail`) the unbound-variable expansion aborts the
job before any other output is emitted, breaking re-attempts of
nightlies started before the rename.
Fix: after sourcing a snapshot, check whether STAGE_SUBPATH is set
and default it to `nightly/${TIMESTAMP}` if not. All legacy
artifacts physically landed under that bare path (the split didn't
exist yet), so the backfill matches what's actually in S3 /
Artifactory / NGC.
Verified locally by sourcing a synthesized legacy env file under
`set -euo pipefail` — the new conditional populates STAGE_SUBPATH
and the step proceeds normally.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
The .env -> .txt rename added an elif fallback and a STAGE_SUBPATH backfill so a "legacy" snapshot wouldn't crash the restore step. That code is unreachable in practice: GitHub Actions reruns always execute the same workflow file the original run was triggered with, so an artifact written by this version of the code is always read back by this same version. No mixed-extension or missing-STAGE_SUBPATH scenario can actually occur in CI. Strip both pieces of defense before they get copy-pasted into something else. Restore is back to a single `if -f .txt` branch with no fallbacks; the file format is still shell-`source`-able so sourcing works identically across the extension change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Two failed nightlies in 24 hours from upstream partial releases:
yarl 1.24.0 (only cp310 wheels, no sdist) on 2026-05-19, then
greenlet 3.5.1 (no linux/aarch64 wheel, no sdist) on 2026-05-20.
Both broke env-builder's cold `uv sync` because the resolver picked
the new "latest" version, found no compatible distribution for our
build platform, and bailed. Range pinning each offender as it
appears is whack-a-mole; the next binary-extension transitive that
ships a partial release does it again.
Systemic fix is what uv itself suggested in the greenlet error
message:
hint: consider adding "sys_platform == 'linux' ..." to
tool.uv.required-environments
`required-environments` makes the resolver refuse any version that
lacks a usable distribution (wheel OR sdist) on every listed
platform, and silently backtracks until it finds one that does.
Forward-compatible: as upstream completes their uploads, the
resolver picks the newer version up automatically.
Changes:
- pyproject.toml: add [tool.uv] required-environments covering
Linux x86_64 / aarch64 (the build-container matrix) plus
macOS arm64 (developer laptops). crick remains sdist-only on
Linux aarch64 and passes the constraint via its sdist (env-builder
compiles it from source today, unchanged behavior).
- pyproject.toml: drop the now-redundant `yarl>=1.23,<2.0` direct
pin + its multi-line comment. required-environments handles it.
- .github/workflows/nightly.yml: new `resolve-check` job on
prod-aiperf-default-v1 after prepare-nightly that runs the
resolver only (no install) and gates build-container on success.
Fail-to-feedback drops from ~10 min (env-builder docker stage)
to ~1 min.
- .github/workflows/resolve-check.yml: new dedicated workflow that
runs the same check on pushes to main / pull-request/* / release/*
branches so PR authors see partial-release failures fast and
explicitly instead of buried in `make first-time-setup`.
- README.md: surface the pre-existing crick / aarch64 install
prereq (system C compiler needed) before users discover it via
cryptic setuptools errors.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
.github/workflows/nightly.yml (1)
1134-1134:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUse resolved SHA in summary instead of
github.sha.For manual dispatch builds,
${{ github.sha }}can differ from the built commit (inputs.sha). Use${{ needs.prepare-nightly.outputs.resolved_sha }}to keep provenance accurate.Suggested patch
- echo "| Commit | \`${{ github.sha }}\` |" + echo "| Commit | \`${{ needs.prepare-nightly.outputs.resolved_sha }}\` |"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/nightly.yml at line 1134, The summary currently prints the built commit using `${{ github.sha }}` which can be incorrect for manual dispatch; update the echo that writes the Commit line so it uses the resolved SHA output from the prepare job i.e. replace `${{ github.sha }}` with `${{ needs.prepare-nightly.outputs.resolved_sha }}` (refer to the echo statement that prints "| Commit | \`${{ github.sha }}\` |" in the nightly.yml workflow) to ensure the summary records the actual built commit provenance.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/nightly.yml:
- Around line 368-373: Update the resolve-check step to harden external actions:
replace floating tags with pinned commit SHAs for actions/checkout,
astral-sh/setup-uv, and actions/setup-python (e.g., use actions/checkout@<sha>,
astral-sh/setup-uv@<sha>, actions/setup-python@<sha>) and add
persist-credentials: false to the actions/checkout step so the checkout does not
leave a token in the runner; ensure you update the existing lines referencing
actions/checkout@v6, astral-sh/setup-uv@v7, and actions/setup-python@v6
accordingly.
In @.github/workflows/resolve-check.yml:
- Around line 36-38: Update the actions/checkout step that uses "uses:
actions/checkout@v6" to explicitly disable credential persistence by adding
"persist-credentials: false" to that checkout step (unless a later workflow step
requires authenticated git operations); this limits exposure of the GITHUB_TOKEN
while leaving the other pinned actions (astral-sh/setup-uv@v7 and
actions/setup-python@v6) unchanged.
---
Outside diff comments:
In @.github/workflows/nightly.yml:
- Line 1134: The summary currently prints the built commit using `${{ github.sha
}}` which can be incorrect for manual dispatch; update the echo that writes the
Commit line so it uses the resolved SHA output from the prepare job i.e. replace
`${{ github.sha }}` with `${{ needs.prepare-nightly.outputs.resolved_sha }}`
(refer to the echo statement that prints "| Commit | \`${{ github.sha }}\` |" in
the nightly.yml workflow) to ensure the summary records the actual built commit
provenance.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: a6957663-0797-454f-b541-f6315c871c9a
📒 Files selected for processing (4)
.github/workflows/nightly.yml.github/workflows/resolve-check.ymlREADME.mdpyproject.toml
Add persist-credentials: false to both new resolve-check checkout steps (the dedicated PR-trigger workflow and the nightly job). Neither step does any authenticated git operations after the checkout, so the default credential persistence just leaves the GITHUB_TOKEN in the runner's .git/config for the rest of the job unnecessarily. CodeRabbit's other suggestion — pin the three actions (actions/checkout, astral-sh/setup-uv, actions/setup-python) to commit SHAs instead of major-version tags — is intentionally not applied here. Every other workflow in .github/workflows/ uses major-version tags, and CodeRabbit's first comment even noted "The version-pinned actions follow the repository's established pattern across all workflows." Pinning just these three lines to SHAs would be inconsistent and create a one-off maintenance surface; SHA-pinning policy, if adopted, should be a repo-wide change in a separate PR. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
…dant nightly- tag prefix
- prepare-nightly: strip nightly-/nightly-manual- from CONTAINER_TAG (NGC
repo aiperf-nightly already conveys this; RELEASE_TYPE carries the
cron-vs-dispatch signal).
- stage-nightly-container: expose the staged multi-arch manifest reference
(${NGC_STAGING_IMAGE_BASE}/aiperf-nightly:${CONTAINER_TAG}) as a job output.
- trigger-gitlab-security-scan: forward it as variables[CONTAINER_IMAGE]
alongside the existing CONTAINER_TAG so GitLab can pull without having to
know the registry/repo layout.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
debermudez
left a comment
There was a problem hiding this comment.
overall looks solid. thanks for the work!
…+ aarch64 install note (#991) Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Example execution
This dramatically lowers runtime by parallelizing the test suite across multiple executors in one job and dynamically balancing expected run time.
Differentiates the location of staging the container and wheels when manually executed (with an override to make it look like a nightly in extraordinary situations).
Closes OPS-6560
Summary by CodeRabbit
Infrastructure
New Features
Refactor
Tests
Documentation