Skip to content

Commit 8d6ea21

Browse files
EagleLocursoragent
andauthored
[CI][Python Client] Retry operator build and extend readiness wait for transient flake (#4873)
* [test][autoscaler] add flaky repro instrumentation and partial mitigation Capture actor lifecycle evidence for autoscaler v2 idle-timeout flakiness and reintroduce actor restart resiliency to mitigate detached actor disappearance. Also widen one short scale-down assertion window to reduce false timeout failures while investigation continues. Co-authored-by: Cursor <cursoragent@cursor.com> * [test][autoscaler] mitigate idle-timeout flake via CI capacity updates Increase Buildkite kind cluster capacity and raise head pod memory limits to reduce head instability under resource pressure. Also widen the final idle-timeout assertion window in autoscaler part2 to reduce timing-sensitive false failures while preserving test intent. * [CI][Python Client] Retry operator build and extend readiness wait for transient flake Target: failing job "Test Python Client" in ray-ecosystem-ci-kuberay-ci #14521 (jid=019d9394-0f77-4c3a-9a72-71f433ea76cb) Observed failure chain in that run: - Docker pull of golang:1.25-bookworm hit 504 Gateway Time-out from registry-1.docker.io - make docker-image failed; operator image never loaded into kind - kubectl wait deployment/kuberay-operator timed out after 90s - Step exited with status 1 Surgical mitigation (only .buildkite/test-python-client.yml): - Retry build-start-operator.sh once on failure with a short backoff - Increase operator readiness wait from 90s to 180s - On first readiness failure, dump diagnostics (pods, describe, logs) and retry the readiness wait once before failing the step Also revert the unrelated autoscaler changes that were on this branch so the PR only contains the fix for the targeted flaky job. --------- Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent d835f47 commit 8d6ea21

1 file changed

Lines changed: 21 additions & 2 deletions

File tree

‎.buildkite/test-python-client.yml‎

Lines changed: 21 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -7,8 +7,27 @@
77
- kubectl config set clusters.kind-kind.server https://docker:6443
88
# Build nightly KubeRay operator image
99
- pushd ray-operator
10-
- source ../.buildkite/build-start-operator.sh
11-
- kubectl wait --timeout=90s --for=condition=Available=true deployment kuberay-operator
10+
- |
11+
for attempt in 1 2; do
12+
if source ../.buildkite/build-start-operator.sh; then
13+
break
14+
fi
15+
if [ "$attempt" -eq 2 ]; then
16+
echo "--- ERROR: build-start-operator.sh failed twice"
17+
exit 1
18+
fi
19+
echo "--- WARN: build-start-operator.sh failed; retrying in 20s"
20+
sleep 20
21+
done
22+
- |
23+
if ! kubectl wait --timeout=180s --for=condition=Available=true deployment kuberay-operator; then
24+
echo "--- kuberay-operator not ready within 180s; collecting diagnostics"
25+
kubectl get pods -A -o wide || true
26+
kubectl describe deployment kuberay-operator || true
27+
kubectl logs --tail=200 -l app.kubernetes.io/name=kuberay || true
28+
echo "--- retrying kuberay-operator readiness wait once"
29+
kubectl wait --timeout=180s --for=condition=Available=true deployment kuberay-operator
30+
fi
1231
- popd
1332
# Setup Python environment and install Python client
1433
- echo "--- START:Setting up Python environment"

0 commit comments

Comments
 (0)