fix(ci): export MALLOC_ARENA_MAX=2 before pytest for component_integration#950
Conversation
…integration glibc reads MALLOC_ARENA_MAX at process startup, so setting it from inside tests/component_integration/conftest.py via os.environ.setdefault was a no-op for the running pytest worker — by then glibc had already initialized its arenas. Component_integration runs aiperf in-process (no subprocesses to inherit the var), so the only effective place to set it is the shell env before pytest starts. Prepend MALLOC_ARENA_MAX=2 to the four pytest invocations in the Makefile that target tests/component_integration/ (test-ci, test-component-integration, test-component-integration-ci, test-component-integration-verbose), and rewrite the conftest comment to reflect reality. This unbroke Ubuntu CI (3.11/3.12/3.13). macOS was unaffected because it uses a different allocator. The latent issue surfaced when #912 pulled in BoTorch/Optuna/scipy/torch and pushed the working set past what glibc's default 8×NCPU arenas could fit on 2-CPU runners, crashing xdist workers (visible as 66/83 systematic FAILED + final xdist INTERNALERROR "list.remove(x): x not in list"). Note: tests/integration/conftest.py keeps its setdefault — that suite spawns aiperf as subprocesses, so the var correctly propagates to children. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@88f0bac66c01f3e02734f9c054a64f9715b223c5Recommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@88f0bac66c01f3e02734f9c054a64f9715b223c5Last updated for commit: |
|
Warning Rate limit exceeded
You’ve run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (19)
WalkthroughThis PR sets MALLOC_ARENA_MAX=2 for component-integration pytest runs, removes a ChangesMemory Arena Configuration for Component Integration Tests
Dev dependency extras
Subprocess process-group helper consolidation
Pytest marker recategorization
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/component_integration/conftest.py`:
- Line 31: Replace the ambiguous multiplication character in the comment that
reads "8×NCPU" with a plain ASCII 'x' so it reads "8xNCPU" to avoid Ruff RUF003;
update the comment text in the same location (the comment containing "default
8×NCPU arenas blows out RAM in 2-CPU CI runners. glibc reads this") accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e1ca39c3-7961-4879-9f90-32b364cabb94
📒 Files selected for processing (2)
Makefiletests/component_integration/conftest.py
Duplicate approval (review lock race condition — now fixed)
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
…tall
GitHub-hosted ubuntu-latest is the lowest-spec runner in our pool
(4 vCPU / 16 GB) and was rate-limiting unit-test wall time. Move the
Linux row of run-unit-tests onto the self-hosted prod-aiperf-builder
pool, add an installed-deps cache, and trim the install step to only
what CI actually needs.
- run-unit-tests.yml:
- Linux row -> prod-aiperf-builder-amd-v1 (macOS stays on
macos-latest -- no macOS in the self-hosted pool).
- Add astral-sh/setup-uv@v5 with enable-cache + uv.lock as the
cache key. Cold pod install drops from ~8 min to ~2-3 min on
warm-cache runs.
- Switch install step from make first-time-setup to make ci-install
(skips pre-commit hooks and redundant install-mock-server).
- Set PYTEST_XDIST_AUTO_NUM_WORKERS=4 on the test step. The builder
pod exposes 48 vCPUs but is RAM-capped at 6 GiB; -n auto without
the cap would OOM-kill the runner agent.
- timeout-minutes: 30 at job level (defensive cap).
- Update the Codecov-upload condition to match the new matrix key.
- Makefile: add a ci-install target -- venv + project install + plugin
enum/overload generation. Skips pre-commit install --install-hooks
and the duplicate install-mock-server that first-time-setup carries.
Component-integration test failures already on main since 94a9102 are
a separate fix.
Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
PR #912 (commit 94a9102) rewrote tokenizer_validator.py and introduced a ProcessPoolExecutor-based HF cache prefetch in validate_tokenizer_early. The skip-prefetch gate ANDed HF_HUB_OFFLINE and TRANSFORMERS_OFFLINE, but the component_integration conftest only set HF_HUB_OFFLINE, so the gate never engaged and the prefetch subprocesses ran. Those subprocesses bypass the in-process Tokenizer.from_pretrained patch (subprocess re-imports HF cleanly) and try to write to the real HF cache. In restricted environments (Linux CI containers, sandboxes) or under concurrent test execution that races on the cache directory, the write fails with EPERM and aiperf aborts with "Configuration resolution failed: [Errno 1] Operation not permitted" -- which surfaced as `request_count == 0` in every component_integration test that runs `aiperf profile ...` on Linux CI (ubuntu-latest and the new builder pool). This has been the cause of run-unit-tests failures on main since 2026-05-15. - src/aiperf/common/tokenizer_validator.py: change the skip-prefetch gate from AND to OR. Either env var being set is enough -- both mean "I have a warm cache, do not touch the network/disk." Requiring both was overly conservative and is what masked the test-harness bug. - tests/component_integration/conftest.py: hf_offline_mode fixture now sets both HF_HUB_OFFLINE and TRANSFORMERS_OFFLINE so the prefetch skip-gate engages even on older aiperf builds where the gate is still ANDed. Restoring prior values for both on teardown. Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
cleanup-leak tests in tests/unit/timing/test_branch_orchestrator{,_adversarial,_adversarial_full}.py
called caplog.at_level(WARNING, logger="aiperf.timing.branch_orchestrator"),
but the warnings they're asserting on are emitted from the sibling module
aiperf.timing._branch_orchestrator_logging (BranchOrchestratorLoggingMixin
was split into its own file with logger=logging.getLogger(__name__)).
On most setups this still happened to work because pytest's caplog catches
at root and propagation from the sibling logger covered for the wrong
target -- but it's fragile: any environment or earlier test that bumps
the sibling logger's effective level above WARNING silently drops the
record before caplog sees it. That's what was hitting Linux/Python 3.13
on the builder pool runs of PR #955 (build (linux, 3.13)):
FAILED test_cleanup_emits_leak_warning_when_state_nonempty
- assert 0 == 1
FAILED test_cleanup_with_pending_joins_logs_leak_warning
- assert 0 == 1
FAILED test_cleanup_clears_pre_dispatched_and_logs_leak
- assert 0 == 1
Point caplog at the actual emitting logger
(aiperf.timing._branch_orchestrator_logging) so the level gate engages
where it matters. Fixes all 5 occurrences across the 3 files.
Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
- actions/checkout v4 -> v6 - actions/setup-python v5 -> v6 - astral-sh/setup-uv v5 -> v8 Latest stable major versions as of 2026-05-19. Other workflows (nightly.yml, fern-docs.yml, etc.) still pin older majors; updating those is out of scope for this PR. Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Previous bump to @v8 broke the job: astral-sh/setup-uv publishes v8.0.0 and v8.1.0 as exact tags but has NOT yet published a floating v8 major tag (v1..v7 exist; v8 does not). @v8 resolves to nothing. Pin to v7 to match the major-only style of the other actions in this workflow. Switch to v8 once astral-sh publishes the floating v8 tag. Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Move shared subprocess process-group setup into the test harness and mark expensive tests so CI can run the intended suite split reliably. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Skip tokenizer preloading in component integration tests and keep forked process-pool children from inheriting the pytest os._exit shim, while preserving CI xdist coverage and Sobol warning assertions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Component integration tests now skip tokenizer preload, so the process-pool reproduction is no longer needed; keep the os._exit harness isolation itself. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
…ner' into ajc/fix-tests
Move measured expensive unit tests behind slow/stress markers and register the existing network marker so default unit runs stay focused. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
…ation (#950) Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
app(...)in-process, ending in an xdistINTERNALERROR: list.remove(x): x not in list(the signature of a worker dying mid-run).os.environ.setdefault("MALLOC_ARENA_MAX", "2")attests/component_integration/conftest.py:33is a no-op for the running pytest worker — glibc readsMALLOC_ARENA_MAXonce at process startup, before Python imports run. Component_integration runs aiperf in-process (no subprocesses to inherit the var), so setting it from inside the conftest never actually capped arenas.Fix
Prepend
MALLOC_ARENA_MAX=2to the four pytest invocations inMakefilethat targettests/component_integration/(test-ci,test-component-integration,test-component-integration-ci,test-component-integration-verbose), so the var is in the shell env before pytest forks workers. Rewrite the misleading conftest comment to point at the Makefile.tests/integration/conftest.pyis left alone — that suite spawns aiperf as a subprocess, and the conftestsetdefaultcorrectly propagates the var to children.Test plan
MALLOC_ARENA_MAX=2 uv run pytest tests/component_integration/cli/ -m component_integration -n auto→ 15/15 passing locallymake -n test-ciandmake -n test-component-integration-cishowMALLOC_ARENA_MAX=2in the expanded pytest commandSummary by CodeRabbit
Chores
Tests