Skip to content

chore: raise PROFILE_CONFIGURE_TIMEOUT default to 600s#936

Merged
matthewkotila merged 2 commits into
mainfrom
mkotila/heartbeat-progress-deadlines
May 18, 2026
Merged

chore: raise PROFILE_CONFIGURE_TIMEOUT default to 600s#936
matthewkotila merged 2 commits into
mainfrom
mkotila/heartbeat-progress-deadlines

Conversation

@matthewkotila

@matthewkotila matthewkotila commented May 14, 2026

Copy link
Copy Markdown
Contributor

Summary

Raises the AIPERF_SERVICE_PROFILE_CONFIGURE_TIMEOUT default from 300s to 600s. The 300s wall-clock was firing during legitimate progress on cold HF --public-dataset (e.g. AIMO) materialization in nightlies.

This is a stopgap. The structural fix — moving DatasetManager resource acquisition out of the PROFILE_CONFIGURE command handler and into service startup, eliminating the cross-service _dataset_configured_event wait — is filed as AIP-900. Once that lands, this default can come back down to a tight value.

Test plan

  • docs/environment-variables.md regenerated via make generate-env-vars-docs
  • Pre-commit hooks pass
  • CI: nightly test_docs_end_to_end with --public-dataset completes within the new 600s budget

Summary by CodeRabbit

  • Bug Fixes
    • Increased the default profile configure timeout from 300 to 600 seconds. This change raises the built-in timeout used when no environment override is provided, allowing longer-running profile configuration operations to complete before timing out while retaining existing validation constraints.

Review Change Stack

@github-actions github-actions Bot added the feat label May 14, 2026
@github-actions

github-actions Bot commented May 14, 2026

Copy link
Copy Markdown

Try out this PR

Quick install:

pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@7474b132f7f65d372b034cbae06cd5a7bcb271c5

Recommended with virtual environment (using uv):

uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@7474b132f7f65d372b034cbae06cd5a7bcb271c5

Last updated for commit: 7474b13Browse code

@github-actions

github-actions Bot commented May 14, 2026

Copy link
Copy Markdown
@coderabbitai

coderabbitai Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 2da5d408-0585-4ed8-a07a-fc93fff486f1

📥 Commits

Reviewing files that changed from the base of the PR and between bd0ea0b and 7474b13.

📒 Files selected for processing (2)
  • docs/environment-variables.md
  • src/aiperf/common/environment.py
✅ Files skipped from review due to trivial changes (1)
  • docs/environment-variables.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/aiperf/common/environment.py

Walkthrough

The profile configuration timeout default is increased from 300.0 to 600.0 seconds. Both the environment settings class and documentation are updated to reflect this change consistently.

Changes

Profile Configure Timeout Configuration

Layer / File(s) Summary
Profile Configure Timeout Default Update
src/aiperf/common/environment.py, docs/environment-variables.md
_ServiceSettings.PROFILE_CONFIGURE_TIMEOUT default and corresponding documentation are both updated from 300.0 to 600.0 seconds.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

🐰 A timeout grows from five to ten,
Profile configure now has more when,
Config and docs in sync once more,
Six hundred seconds at the core! ⏰

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely describes the main change: increasing the PROFILE_CONFIGURE_TIMEOUT default from 300s to 600s, which is the primary modification across all changed files.
Docstring Coverage ✅ Passed Docstring coverage is 86.67% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/aiperf/common/base_component_service.py`:
- Around line 62-72: The docstring for activity guidance (referenced symbols:
mark_activity, activity_pulse, last_activity_ns) incorrectly states that
activity updates extend/reset controller deadlines; update the wording to
reflect current controller behavior: state that mark_activity and activity_pulse
only update/persist last_activity_ns for observability/audit and do not change
the wall-clock command timeout enforced by the controller, and mention that
controllers still use fixed wall-clock timeouts so callers must not assume
activity will extend deadlines. Keep references to using activity_pulse for
long-running unbounded work but clarify its purpose as periodic heartbeat
persistence rather than deadline extension.

In `@src/aiperf/dataset/dataset_manager.py`:
- Around line 126-132: Update the docstring that mentions activity_pulse in
dataset_manager.py to remove the implication that it "resets the controller's
idle deadline" or prevents "kill-for-idle"; instead state that wrapping the
handler in activity_pulse ensures activity timestamps are updated/propagated
(for liveness visibility) while slow work runs in run_in_executor (e.g.,
public-dataset cold-cache materialization), and note that controller timeouts
remain governed by the fixed profile-configure timeout. Mention the
activity_pulse helper by name and the slow-path run_in_executor behavior so
readers can locate the affected code.

In `@tests/unit/dataset/test_dataset_manager.py`:
- Around line 155-161: The test currently compares global asyncio.all_tasks()
counts which can flake; instead capture the manager's last_activity_ns before
entering the context (or immediately after entering) and after exiting ensure it
no longer updates: call initialized_dataset_manager.last_activity_ns before the
async with, enter and exit async with
initialized_dataset_manager.activity_pulse(...), await a short asyncio.sleep(0)
(or a couple ticks) to let cancellation propagate, then read last_activity_ns
again and assert the value is equal to the prior value (or that it hasn't
changed across a small interval) to deterministically verify the pulse stopped;
reference initialized_dataset_manager.activity_pulse and
initialized_dataset_manager.last_activity_ns in your change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 44ce0b0e-f122-4073-ae4b-80e9d30908c1

📥 Commits

Reviewing files that changed from the base of the PR and between e7602bc and b259919.

📒 Files selected for processing (9)
  • docs/environment-variables.md
  • src/aiperf/common/base_component_service.py
  • src/aiperf/common/environment.py
  • src/aiperf/common/messages/service_messages.py
  • src/aiperf/common/models/service_models.py
  • src/aiperf/controller/system_controller.py
  • src/aiperf/dataset/dataset_manager.py
  • tests/unit/common/messages/test_messages.py
  • tests/unit/dataset/test_dataset_manager.py
Comment thread src/aiperf/common/base_component_service.py Outdated
Comment thread src/aiperf/dataset/dataset_manager.py Outdated
Comment thread tests/unit/dataset/test_dataset_manager.py Outdated
@codecov

codecov Bot commented May 14, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@ajcasagrande ajcasagrande left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LLM-ergonomics review: I found a few agent-facing maintainability issues around the new heartbeat activity API. Mechanical checks were clean (make check-ergonomics, make check-ruff-baselined).

Comment thread src/aiperf/common/base_component_service.py Outdated
Comment thread src/aiperf/common/base_component_service.py Outdated
Comment thread src/aiperf/common/base_component_service.py Outdated
@matthewkotila matthewkotila force-pushed the mkotila/heartbeat-progress-deadlines branch from afd0001 to bd0ea0b Compare May 14, 2026 17:32
@matthewkotila matthewkotila changed the title feat: heartbeat last_activity_ns + bump profile-configure timeout to 600s May 14, 2026
@github-actions github-actions Bot added chore and removed feat labels May 14, 2026
@matthewkotila matthewkotila merged commit 318824a into main May 18, 2026
24 of 30 checks passed
@matthewkotila matthewkotila deleted the mkotila/heartbeat-progress-deadlines branch May 18, 2026 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 participants