Commit 8d6ea21
[CI][Python Client] Retry operator build and extend readiness wait for transient flake (#4873)
* [test][autoscaler] add flaky repro instrumentation and partial mitigation
Capture actor lifecycle evidence for autoscaler v2 idle-timeout flakiness and reintroduce actor restart resiliency to mitigate detached actor disappearance. Also widen one short scale-down assertion window to reduce false timeout failures while investigation continues.
Co-authored-by: Cursor <cursoragent@cursor.com>
* [test][autoscaler] mitigate idle-timeout flake via CI capacity updates
Increase Buildkite kind cluster capacity and raise head pod memory limits to reduce head instability under resource pressure. Also widen the final idle-timeout assertion window in autoscaler part2 to reduce timing-sensitive false failures while preserving test intent.
* [CI][Python Client] Retry operator build and extend readiness wait for transient flake
Target: failing job "Test Python Client" in ray-ecosystem-ci-kuberay-ci #14521
(jid=019d9394-0f77-4c3a-9a72-71f433ea76cb)
Observed failure chain in that run:
- Docker pull of golang:1.25-bookworm hit 504 Gateway Time-out from registry-1.docker.io
- make docker-image failed; operator image never loaded into kind
- kubectl wait deployment/kuberay-operator timed out after 90s
- Step exited with status 1
Surgical mitigation (only .buildkite/test-python-client.yml):
- Retry build-start-operator.sh once on failure with a short backoff
- Increase operator readiness wait from 90s to 180s
- On first readiness failure, dump diagnostics (pods, describe, logs)
and retry the readiness wait once before failing the step
Also revert the unrelated autoscaler changes that were on this branch so
the PR only contains the fix for the targeted flaky job.
---------
Co-authored-by: Cursor <cursoragent@cursor.com>1 parent d835f47 commit 8d6ea21
1 file changed
Lines changed: 21 additions & 2 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
7 | 7 | | |
8 | 8 | | |
9 | 9 | | |
10 | | - | |
11 | | - | |
| 10 | + | |
| 11 | + | |
| 12 | + | |
| 13 | + | |
| 14 | + | |
| 15 | + | |
| 16 | + | |
| 17 | + | |
| 18 | + | |
| 19 | + | |
| 20 | + | |
| 21 | + | |
| 22 | + | |
| 23 | + | |
| 24 | + | |
| 25 | + | |
| 26 | + | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
| 30 | + | |
12 | 31 | | |
13 | 32 | | |
14 | 33 | | |
| |||
0 commit comments