chore: raise PROFILE_CONFIGURE_TIMEOUT default to 600s#936
Conversation
Try out this PRQuick install: pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@7474b132f7f65d372b034cbae06cd5a7bcb271c5Recommended with virtual environment (using uv): uv venv --python 3.12 && source .venv/bin/activate
uv pip install --upgrade --force-reinstall git+https://github.com/ai-dynamo/aiperf.git@7474b132f7f65d372b034cbae06cd5a7bcb271c5Last updated for commit: |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (2)
✅ Files skipped from review due to trivial changes (1)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughThe profile configuration timeout default is increased from 300.0 to 600.0 seconds. Both the environment settings class and documentation are updated to reflect this change consistently. ChangesProfile Configure Timeout Configuration
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/aiperf/common/base_component_service.py`:
- Around line 62-72: The docstring for activity guidance (referenced symbols:
mark_activity, activity_pulse, last_activity_ns) incorrectly states that
activity updates extend/reset controller deadlines; update the wording to
reflect current controller behavior: state that mark_activity and activity_pulse
only update/persist last_activity_ns for observability/audit and do not change
the wall-clock command timeout enforced by the controller, and mention that
controllers still use fixed wall-clock timeouts so callers must not assume
activity will extend deadlines. Keep references to using activity_pulse for
long-running unbounded work but clarify its purpose as periodic heartbeat
persistence rather than deadline extension.
In `@src/aiperf/dataset/dataset_manager.py`:
- Around line 126-132: Update the docstring that mentions activity_pulse in
dataset_manager.py to remove the implication that it "resets the controller's
idle deadline" or prevents "kill-for-idle"; instead state that wrapping the
handler in activity_pulse ensures activity timestamps are updated/propagated
(for liveness visibility) while slow work runs in run_in_executor (e.g.,
public-dataset cold-cache materialization), and note that controller timeouts
remain governed by the fixed profile-configure timeout. Mention the
activity_pulse helper by name and the slow-path run_in_executor behavior so
readers can locate the affected code.
In `@tests/unit/dataset/test_dataset_manager.py`:
- Around line 155-161: The test currently compares global asyncio.all_tasks()
counts which can flake; instead capture the manager's last_activity_ns before
entering the context (or immediately after entering) and after exiting ensure it
no longer updates: call initialized_dataset_manager.last_activity_ns before the
async with, enter and exit async with
initialized_dataset_manager.activity_pulse(...), await a short asyncio.sleep(0)
(or a couple ticks) to let cancellation propagate, then read last_activity_ns
again and assert the value is equal to the prior value (or that it hasn't
changed across a small interval) to deterministically verify the pulse stopped;
reference initialized_dataset_manager.activity_pulse and
initialized_dataset_manager.last_activity_ns in your change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 44ce0b0e-f122-4073-ae4b-80e9d30908c1
📒 Files selected for processing (9)
docs/environment-variables.mdsrc/aiperf/common/base_component_service.pysrc/aiperf/common/environment.pysrc/aiperf/common/messages/service_messages.pysrc/aiperf/common/models/service_models.pysrc/aiperf/controller/system_controller.pysrc/aiperf/dataset/dataset_manager.pytests/unit/common/messages/test_messages.pytests/unit/dataset/test_dataset_manager.py
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
ajcasagrande
left a comment
There was a problem hiding this comment.
LLM-ergonomics review: I found a few agent-facing maintainability issues around the new heartbeat activity API. Mechanical checks were clean (make check-ergonomics, make check-ruff-baselined).
afd0001 to
bd0ea0b
Compare
Summary
Raises the
AIPERF_SERVICE_PROFILE_CONFIGURE_TIMEOUTdefault from 300s to 600s. The 300s wall-clock was firing during legitimate progress on cold HF--public-dataset(e.g. AIMO) materialization in nightlies.This is a stopgap. The structural fix — moving DatasetManager resource acquisition out of the
PROFILE_CONFIGUREcommand handler and into service startup, eliminating the cross-service_dataset_configured_eventwait — is filed as AIP-900. Once that lands, this default can come back down to a tight value.Test plan
docs/environment-variables.mdregenerated viamake generate-env-vars-docstest_docs_end_to_endwith--public-datasetcompletes within the new 600s budgetSummary by CodeRabbit