Skip to content

fix(ci): export MALLOC_ARENA_MAX=2 before pytest for component_integration#950

Merged
ajcasagrande merged 14 commits into
mainfrom
ajc/fix-tests
May 19, 2026
Merged

fix(ci): export MALLOC_ARENA_MAX=2 before pytest for component_integration#950
ajcasagrande merged 14 commits into
mainfrom
ajc/fix-tests

Conversation

@ajcasagrande

@ajcasagrande ajcasagrande commented May 16, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Ubuntu CI (3.11/3.12/3.13) for unit + component_integration tests has been failing since feat: YAML-native v2 config + adaptive sweep orchestrator with BO & search recipes #912; macOS is fine. Symptom: ~66/83 systematic FAILEDs the moment any test calls app(...) in-process, ending in an xdist INTERNALERROR: list.remove(x): x not in list (the signature of a worker dying mid-run).
  • Root cause: os.environ.setdefault("MALLOC_ARENA_MAX", "2") at tests/component_integration/conftest.py:33 is a no-op for the running pytest worker — glibc reads MALLOC_ARENA_MAX once at process startup, before Python imports run. Component_integration runs aiperf in-process (no subprocesses to inherit the var), so setting it from inside the conftest never actually capped arenas.
  • feat: YAML-native v2 config + adaptive sweep orchestrator with BO & search recipes #912 pulled in BoTorch / Optuna / scipy / torch, which pushed the working set past what glibc's default 8×NCPU arenas can fit on the 2-CPU GitHub runners → xdist workers crashed.

Fix

Prepend MALLOC_ARENA_MAX=2 to the four pytest invocations in Makefile that target tests/component_integration/ (test-ci, test-component-integration, test-component-integration-ci, test-component-integration-verbose), so the var is in the shell env before pytest forks workers. Rewrite the misleading conftest comment to point at the Makefile.

tests/integration/conftest.py is left alone — that suite spawns aiperf as a subprocess, and the conftest setdefault correctly propagates the var to children.

Test plan

  • MALLOC_ARENA_MAX=2 uv run pytest tests/component_integration/cli/ -m component_integration -n auto → 15/15 passing locally
  • make -n test-ci and make -n test-component-integration-ci show MALLOC_ARENA_MAX=2 in the expanded pytest command
  • CI on this PR shows Ubuntu jobs green

Summary by CodeRabbit

  • Chores

    • Adjusted test-run environment to set a memory-arena limit and removed automatic parallel test-worker flag for CI component tests.
    • Reduced dev extras by removing one optional sub-dependency.
  • Tests

    • Reclassified multiple component/integration tests with slow/stress markers to refine test selection and prioritization.
    • Consolidated subprocess-handling into a shared test harness helper for consistent test process behavior.

Review Change Stack

…integration

glibc reads MALLOC_ARENA_MAX at process startup, so setting it from
inside tests/component_integration/conftest.py via os.environ.setdefault
was a no-op for the running pytest worker — by then glibc had already
initialized its arenas.

Component_integration runs aiperf in-process (no subprocesses to
inherit the var), so the only effective place to set it is the shell
env before pytest starts. Prepend MALLOC_ARENA_MAX=2 to the four
pytest invocations in the Makefile that target tests/component_integration/
(test-ci, test-component-integration, test-component-integration-ci,
test-component-integration-verbose), and rewrite the conftest comment
to reflect reality.

This unbroke Ubuntu CI (3.11/3.12/3.13). macOS was unaffected because
it uses a different allocator. The latent issue surfaced when #912
pulled in BoTorch/Optuna/scipy/torch and pushed the working set past
what glibc's default 8×NCPU arenas could fit on 2-CPU runners,
crashing xdist workers (visible as 66/83 systematic FAILED + final
xdist INTERNALERROR "list.remove(x): x not in list").

Note: tests/integration/conftest.py keeps its setdefault — that suite
spawns aiperf as subprocesses, so the var correctly propagates to
children.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
@github-actions

github-actions Bot commented May 16, 2026

Copy link
Copy Markdown

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@88f0bac66c01f3e02734f9c054a64f9715b223c5

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@88f0bac66c01f3e02734f9c054a64f9715b223c5

Last updated for commit: 88f0bacBrowse code

@github-actions github-actions Bot added the fix label May 16, 2026
@ajcasagrande ajcasagrande enabled auto-merge (squash) May 16, 2026 00:39
@coderabbitai

coderabbitai Bot commented May 16, 2026

Copy link
Copy Markdown
Contributor

Warning

Rate limit exceeded

@ajcasagrande has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 19 minutes and 52 seconds before requesting another review.

You’ve run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c71bf665-052a-451d-8f29-a73fb061f5d2

📥 Commits

Reviewing files that changed from the base of the PR and between ae0358e and 88f0bac.

📒 Files selected for processing (19)
  • .github/workflows/run-unit-tests.yml
  • Makefile
  • docs/environment-variables.md
  • pyproject.toml
  • src/aiperf/common/environment.py
  • src/aiperf/common/tokenizer_validator.py
  • src/aiperf/config/sweep/expand_qmc.py
  • tests/component_integration/conftest.py
  • tests/unit/common/test_tokenizer_validator.py
  • tests/unit/config/test_sweep_qmc.py
  • tests/unit/config/test_sweep_qmc_properties.py
  • tests/unit/config/test_sweep_round3_fixes.py
  • tests/unit/dataset/agentic_code_gen/test_roundtrip.py
  • tests/unit/orchestrator/search_planner/test_optuna_dsp_kernel.py
  • tests/unit/orchestrator/test_execute_adaptive_search.py
  • tests/unit/search_recipes/test_recipes_round3.py
  • tests/unit/timing/test_branch_orchestrator.py
  • tests/unit/timing/test_branch_orchestrator_adversarial.py
  • tests/unit/timing/test_branch_orchestrator_adversarial_full.py

Walkthrough

This PR sets MALLOC_ARENA_MAX=2 for component-integration pytest runs, removes a -n auto xdist flag from test-ci, clarifies the conftest comment about Makefile precedence, moves a subprocess process-group helper into tests.harness.subprocess (and imports it), updates one dev extra in pyproject, and reclassifies several tests with slow/stress/integration pytest markers.

Changes

Memory Arena Configuration for Component Integration Tests

Layer / File(s) Summary
Makefile pytest environment variable
Makefile
test-ci, test-component-integration, test-component-integration-ci, and test-component-integration-verbose now prefix component-integration pytest invocations with MALLOC_ARENA_MAX=2 (and test-ci drops -n auto where previously present).
Configuration documentation
tests/component_integration/conftest.py
Clarified comment: glibc reads MALLOC_ARENA_MAX at process startup and the Makefile export is authoritative; the in-file os.environ.setdefault is a fallback.

Dev dependency extras

Layer / File(s) Summary
pyproject dev extras update
pyproject.toml
project.optional-dependencies.dev now references aiperf[mlflow,otel] (removed botorch extra).

Subprocess process-group helper consolidation

Layer / File(s) Summary
Add harness helper
tests/harness/subprocess.py
Adds _new_process_group_kwargs() to detect subprocess.Popen process_group support and return either {"process_group": 0} or {"start_new_session": True}.
Integration conftest import
tests/integration/conftest.py
Stops defining _new_process_group_kwargs locally, imports it from tests.harness.subprocess, and removes the local inspect dependency.
Consumer test import updates
tests/component_integration/test_process_title.py
Updated import to obtain _new_process_group_kwargs from tests.harness.subprocess instead of tests.integration.conftest.

Pytest marker recategorization

Layer / File(s) Summary
Add slow/stress markers and reclassify tests
tests/component_integration/dataset/test_dataset_sampling.py, tests/component_integration/test_multi_objective_e2e.py, tests/component_integration/test_search_e2e.py, tests/component_integration/timing/test_advanced_scenarios.py, tests/component_integration/timing/test_dag_timing_pathology.py, tests/integration/plot/test_plot_mlflow_upload.py, tests/integration/post_processors/test_mlflow_live_correctness.py, tests/integration/post_processors/test_otel_live_export.py, tests/integration/post_processors/test_otel_payload_content.py
Added @pytest.mark.slow or @pytest.mark.stress decorators and replaced several @pytest.mark.component_integration markers with @pytest.mark.integration plus @pytest.mark.slow where shown; no test logic or assertions changed.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I nudge the Makefile, set arenas to two,
A harness helper hops into view,
Tests get re-marked, slow and steady pace,
Dev extras pruned with quiet grace,
A rabbit's cheer for a tidier test-roo!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 68.75% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and concisely summarizes the main fix: exporting MALLOC_ARENA_MAX=2 before pytest for component_integration tests. It directly addresses the primary change across the Makefile modifications.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/component_integration/conftest.py`:
- Line 31: Replace the ambiguous multiplication character in the comment that
reads "8×NCPU" with a plain ASCII 'x' so it reads "8xNCPU" to avoid Ruff RUF003;
update the comment text in the same location (the comment containing "default
8×NCPU arenas blows out RAM in 2-CPU CI runners. glibc reads this") accordingly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e1ca39c3-7961-4879-9f90-32b364cabb94

📥 Commits

Reviewing files that changed from the base of the PR and between bac953d and 2cf5b4c.

📒 Files selected for processing (2)
  • Makefile
  • tests/component_integration/conftest.py
Comment thread tests/component_integration/conftest.py
dynamo-ops
dynamo-ops previously approved these changes May 16, 2026
dynamo-ops
dynamo-ops previously approved these changes May 16, 2026
@dynamo-ops dynamo-ops dismissed stale reviews from themself May 16, 2026 01:05

Duplicate approval (review lock race condition — now fixed)

Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
saturley-hall and others added 11 commits May 18, 2026 21:33
…tall

GitHub-hosted ubuntu-latest is the lowest-spec runner in our pool
(4 vCPU / 16 GB) and was rate-limiting unit-test wall time. Move the
Linux row of run-unit-tests onto the self-hosted prod-aiperf-builder
pool, add an installed-deps cache, and trim the install step to only
what CI actually needs.

- run-unit-tests.yml:
  - Linux row -> prod-aiperf-builder-amd-v1 (macOS stays on
    macos-latest -- no macOS in the self-hosted pool).
  - Add astral-sh/setup-uv@v5 with enable-cache + uv.lock as the
    cache key. Cold pod install drops from ~8 min to ~2-3 min on
    warm-cache runs.
  - Switch install step from make first-time-setup to make ci-install
    (skips pre-commit hooks and redundant install-mock-server).
  - Set PYTEST_XDIST_AUTO_NUM_WORKERS=4 on the test step. The builder
    pod exposes 48 vCPUs but is RAM-capped at 6 GiB; -n auto without
    the cap would OOM-kill the runner agent.
  - timeout-minutes: 30 at job level (defensive cap).
  - Update the Codecov-upload condition to match the new matrix key.

- Makefile: add a ci-install target -- venv + project install + plugin
  enum/overload generation. Skips pre-commit install --install-hooks
  and the duplicate install-mock-server that first-time-setup carries.

Component-integration test failures already on main since 94a9102 are
a separate fix.

Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
PR #912 (commit 94a9102) rewrote tokenizer_validator.py and introduced
a ProcessPoolExecutor-based HF cache prefetch in validate_tokenizer_early.
The skip-prefetch gate ANDed HF_HUB_OFFLINE and TRANSFORMERS_OFFLINE, but
the component_integration conftest only set HF_HUB_OFFLINE, so the gate
never engaged and the prefetch subprocesses ran. Those subprocesses
bypass the in-process Tokenizer.from_pretrained patch (subprocess
re-imports HF cleanly) and try to write to the real HF cache. In
restricted environments (Linux CI containers, sandboxes) or under
concurrent test execution that races on the cache directory, the write
fails with EPERM and aiperf aborts with "Configuration resolution failed:
[Errno 1] Operation not permitted" -- which surfaced as `request_count
== 0` in every component_integration test that runs `aiperf profile ...`
on Linux CI (ubuntu-latest and the new builder pool).

This has been the cause of run-unit-tests failures on main since
2026-05-15.

- src/aiperf/common/tokenizer_validator.py: change the skip-prefetch
  gate from AND to OR. Either env var being set is enough -- both mean
  "I have a warm cache, do not touch the network/disk." Requiring both
  was overly conservative and is what masked the test-harness bug.

- tests/component_integration/conftest.py: hf_offline_mode fixture now
  sets both HF_HUB_OFFLINE and TRANSFORMERS_OFFLINE so the prefetch
  skip-gate engages even on older aiperf builds where the gate is still
  ANDed. Restoring prior values for both on teardown.

Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
cleanup-leak tests in tests/unit/timing/test_branch_orchestrator{,_adversarial,_adversarial_full}.py
called caplog.at_level(WARNING, logger="aiperf.timing.branch_orchestrator"),
but the warnings they're asserting on are emitted from the sibling module
aiperf.timing._branch_orchestrator_logging (BranchOrchestratorLoggingMixin
was split into its own file with logger=logging.getLogger(__name__)).

On most setups this still happened to work because pytest's caplog catches
at root and propagation from the sibling logger covered for the wrong
target -- but it's fragile: any environment or earlier test that bumps
the sibling logger's effective level above WARNING silently drops the
record before caplog sees it. That's what was hitting Linux/Python 3.13
on the builder pool runs of PR #955 (build (linux, 3.13)):

    FAILED test_cleanup_emits_leak_warning_when_state_nonempty
        - assert 0 == 1
    FAILED test_cleanup_with_pending_joins_logs_leak_warning
        - assert 0 == 1
    FAILED test_cleanup_clears_pre_dispatched_and_logs_leak
        - assert 0 == 1

Point caplog at the actual emitting logger
(aiperf.timing._branch_orchestrator_logging) so the level gate engages
where it matters. Fixes all 5 occurrences across the 3 files.

Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
- actions/checkout v4 -> v6
- actions/setup-python v5 -> v6
- astral-sh/setup-uv v5 -> v8

Latest stable major versions as of 2026-05-19. Other workflows
(nightly.yml, fern-docs.yml, etc.) still pin older majors; updating
those is out of scope for this PR.

Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Previous bump to @v8 broke the job: astral-sh/setup-uv publishes
v8.0.0 and v8.1.0 as exact tags but has NOT yet published a floating
v8 major tag (v1..v7 exist; v8 does not). @v8 resolves to nothing.

Pin to v7 to match the major-only style of the other actions in this
workflow. Switch to v8 once astral-sh publishes the floating v8 tag.

Signed-off-by: Harrison King Saturley-Hall <hsaturleyhal@nvidia.com>
Move shared subprocess process-group setup into the test harness and mark expensive tests so CI can run the intended suite split reliably.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Skip tokenizer preloading in component integration tests and keep forked process-pool children from inheriting the pytest os._exit shim, while preserving CI xdist coverage and Sobol warning assertions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Component integration tests now skip tokenizer preload, so the process-pool reproduction is no longer needed; keep the os._exit harness isolation itself.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Move measured expensive unit tests behind slow/stress markers and register the existing network marker so default unit runs stay focused.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
@ajcasagrande ajcasagrande merged commit 93c7f1e into main May 19, 2026
29 checks passed
@ajcasagrande ajcasagrande deleted the ajc/fix-tests branch May 19, 2026 04:33
@codecov

codecov Bot commented May 19, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 91.66667% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/aiperf/common/tokenizer_validator.py 80.00% 0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

ajcasagrande added a commit that referenced this pull request May 19, 2026
…ation (#950)

Signed-off-by: Anthony Casagrande <acasagrande@nvidia.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

4 participants