[chore] fix cronjob crd inconsistent#4292
Merged
Merged
Conversation
Signed-off-by: Rueian <rueiancsie@gmail.com>
win5923
approved these changes
Dec 20, 2025
pipo02mix
added a commit
to giantswarm/kuberay
that referenced
this pull request
May 19, 2026
* [APIServer][Docs] Add user guide for retry behavior & configuration (#4144) * [Docs] Add the draft description about feature intro, configurations, and usecases Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * [Fix] Update the retry walk-through Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * [Doc] rewrite the first 2 sections Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * [Doc] Revise documentation wording and add Observing Retry Behavior section Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * [Fix] fix linting issue by running pre-commit run berfore commiting Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * [Fix] fix linting errors in the Markdown linting Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * [Fix] Clean up the math equation Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * Update the math formula of Backoff calculation. Co-authored-by: Nary Yeh <60069744+machichima@users.noreply.github.com> Signed-off-by: JustinYeh <justinyeh1995@gmail.com> * [Fix] Explicitly mentioned exponential backoff and removed the customization parts Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * [Docs] Clarify naming by replacing “APIServer” with “KubeRay APIServer” Co-authored-by: Cheng-Yeh Chung <kenchung285@gmail.com> Signed-off-by: JustinYeh <justinyeh1995@gmail.com> * [Docs] Rename retry-configuration.md to retry-behavior.md for accuracy Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * Update Title to KubeRay APIServer Retry Behavior Co-authored-by: Cheng-Yeh Chung <kenchung285@gmail.com> Signed-off-by: JustinYeh <justinyeh1995@gmail.com> * [Docs] Add a note about the limitation of retry configuration Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> --------- Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> Signed-off-by: JustinYeh <justinyeh1995@gmail.com> Co-authored-by: Nary Yeh <60069744+machichima@users.noreply.github.com> Co-authored-by: Cheng-Yeh Chung <kenchung285@gmail.com> * Support X-Ray-Authorization fallback header for accepting auth token via proxy (#4213) * Support X-Ray-Authorization fallback header for accepting auth token in dashboard Signed-off-by: Future-Outlier <eric901201@gmail.com> * remove todo comment Signed-off-by: Future-Outlier <eric901201@gmail.com> --------- Signed-off-by: Future-Outlier <eric901201@gmail.com> * [RayCluster] make auth token secret name consistency (#4216) Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayCluster] Status includes head containter status message (#4196) * [RayCluster] Status includes head containter status message Signed-off-by: Spencer Peterson <spencerjp@google.com> * lint Signed-off-by: Spencer Peterson <spencerjp@google.com> * [RayCluster] Containers not ready status reflects structured reason Signed-off-by: Spencer Peterson <spencerjp@google.com> * nit Signed-off-by: Spencer Peterson <spencerjp@google.com> --------- Signed-off-by: Spencer Peterson <spencerjp@google.com> * Remove erroneous call in applyServeTargetCapacity (#4212) Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * [RayJob] Add token authentication support for light weight job submitter (#4215) * [RayJob] light weight job submitter auth token support Signed-off-by: Future-Outlier <eric901201@gmail.com> * X-Ray-Authorization Signed-off-by: Rueian <rueiancsie@gmail.com> --------- Signed-off-by: Future-Outlier <eric901201@gmail.com> Signed-off-by: Rueian <rueiancsie@gmail.com> Co-authored-by: Rueian <rueiancsie@gmail.com> * feat: kubectl ray get token command (#4218) * feat: kubectl ray get token command Signed-off-by: Rueian <rueiancsie@gmail.com> * Update kubectl-plugin/pkg/cmd/get/get_token_test.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Rueian <rueiancsie@gmail.com> * Update kubectl-plugin/pkg/cmd/get/get_token.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Rueian <rueiancsie@gmail.com> * make sure the raycluster exists before getting the secret Signed-off-by: Rueian <rueiancsie@gmail.com> * better ux Signed-off-by: Rueian <rueiancsie@gmail.com> * Update kubectl-plugin/pkg/cmd/get/get_token.go Co-authored-by: Han-Ju Chen (Future-Outlier) <eric901201@gmail.com> Signed-off-by: Rueian <rueiancsie@gmail.com> --------- Signed-off-by: Rueian <rueiancsie@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Han-Ju Chen (Future-Outlier) <eric901201@gmail.com> * feat: upgrade to Ray 2.52.0 to support token auth mode (#4152) * update Signed-off-by: Future-Outlier <eric901201@gmail.com> * update Signed-off-by: Future-Outlier <eric901201@gmail.com> * update Signed-off-by: Future-Outlier <eric901201@gmail.com> * update Signed-off-by: Future-Outlier <eric901201@gmail.com> * Trigger CI Signed-off-by: Future-Outlier <eric901201@gmail.com> * andrew's comment Signed-off-by: Future-Outlier <eric901201@gmail.com> * update Signed-off-by: Future-Outlier <eric901201@gmail.com> * Revert ray ml image Signed-off-by: Future-Outlier <eric901201@gmail.com> --------- Signed-off-by: Future-Outlier <eric901201@gmail.com> * [Chore] Remove unused variable in volcano scheduler (#4223) Signed-off-by: wei-chenglai <qazwsx0939059006@gmail.com> * [e2e] RayJob Auth Mode E2E (#4229) * [E2E] RayJob Auth Mode E2E Signed-off-by: seanlaii <qazwsx0939059006@gmail.com> * refactor * refactor --------- Signed-off-by: seanlaii <qazwsx0939059006@gmail.com> * Update README with additional resource links (#4230) * Update README with additional resource links Added links for KubeRay APIServer and Dashboard for more details. Signed-off-by: Jun-Hao Wan <ken89@kimo.com> * Update README.md Signed-off-by: Jun-Hao Wan <ken89@kimo.com> --------- Signed-off-by: Jun-Hao Wan <ken89@kimo.com> * introduce historyserver directory and project structure (#4232) Signed-off-by: Andrew Sy Kim <andrewsy@google.com> * [RayJob] light weight job submitter upgrade to 1.5.1 to support auth token mode (#4235) Signed-off-by: Future-Outlier <eric901201@gmail.com> * Add example in GKE to enable Ray resource isolation using cgroupsv2 and writable cgroup containers (#4236) * Add example in GKE to enable Ray resource isolation using cgroupsv2 and writable cgroup containers Signed-off-by: Andrew Sy Kim <andrewsy@google.com> * use lower resource requests Signed-off-by: Andrew Sy Kim <andrewsy@google.com> --------- Signed-off-by: Andrew Sy Kim <andrewsy@google.com> * add sample that uses --system-reserved-cpu and --system-reserved-memory (#4237) Signed-off-by: Andrew Sy Kim <andrewsy@google.com> * [e2e] Enhance RayCluster Auth E2E (#4231) * [e2e] Enhance RayCluster Auth E2E Signed-off-by: seanlaii <qazwsx0939059006@gmail.com> * fix test * fix test * fix test * fix test * fix test --------- Signed-off-by: seanlaii <qazwsx0939059006@gmail.com> * enhancement: Update docker base image. (#4193) * [RayService] Directly fail CR if is invalid (#4228) * [RayService] Directly fail CR if is invalid Signed-off-by: win5923 <ken89@kimo.com> * nit: set the name with strings.Repeat(a, 48) Signed-off-by: win5923 <ken89@kimo.com> --------- Signed-off-by: win5923 <ken89@kimo.com> * [Chore] Upgrade operator version in test-sample-yamls (#4248) Signed-off-by: wei-chenglai <qazwsx0939059006@gmail.com> * feat: add allow method to api server when allow cors (#4259) Signed-off-by: Cheyu Wu <cheyu1220@gmail.com> * Bump urllib3 from 2.5.0 to 2.6.0 in /clients/python-client (#4260) Bumps [urllib3](https://github.com/urllib3/urllib3) from 2.5.0 to 2.6.0. - [Release notes](https://github.com/urllib3/urllib3/releases) - [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst) - [Commits](https://github.com/urllib3/urllib3/compare/2.5.0...2.6.0) --- updated-dependencies: - dependency-name: urllib3 dependency-version: 2.6.0 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump next from 15.2.4 to 15.4.8 in /dashboard (#4254) * Bump next from 15.2.4 to 15.4.8 in /dashboard Bumps [next](https://github.com/vercel/next.js) from 15.2.4 to 15.4.8. - [Release notes](https://github.com/vercel/next.js/releases) - [Changelog](https://github.com/vercel/next.js/blob/canary/release.js) - [Commits](https://github.com/vercel/next.js/compare/v15.2.4...v15.4.8) --- updated-dependencies: - dependency-name: next dependency-version: 15.4.8 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> * fix dep issue Signed-off-by: Future-Outlier <eric901201@gmail.com> --------- Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Future-Outlier <eric901201@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Future-Outlier <eric901201@gmail.com> * [Docs] Upgrade kind base image to v1.26.0 (#4252) * docs: Upgrade kind base image to 1.26 for dev Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * Trigger CI Signed-off-by: Future-Outlier <eric901201@gmail.com> --------- Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> Signed-off-by: Future-Outlier <eric901201@gmail.com> Co-authored-by: Future-Outlier <eric901201@gmail.com> * chore: remove unuse file (#4247) Signed-off-by: Cheyu Wu <cheyu1220@gmail.com> * feat: Add runtimeClassName support for head and worker Pods (#4184) * feat: Add runtimeClassName support for head and worker Pods * fix: pre-commit linting errors * chore: Update values.yaml * [RayService] Migrate from Endpoints API to EndpointSlice API for RayService (#4245) * Migrate from Endpoints API to EndpointSlice API for RayService Signed-off-by: seanlaii <qazwsx0939059006@gmail.com> * trigger test * add back endpoints rule for backward compatibility * add comment * fix comment * de-duplicate endpoint based on pod uid * address comment * change TODO message * trigger test * remove endpoint RBAC * move comment * change logging level --------- Signed-off-by: seanlaii <qazwsx0939059006@gmail.com> * fix: hardening kuberay operator security context (#4243) Signed-off-by: lilylinh <lhacaoth@redhat.com> * [CI] Upgrade operator version from v1.4.2 to v1.5.1 (#4261) * chore: Bump operator ver to v1.5.1 Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * fix: Modify prev version to v1.4.2 Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * Apply suggestions from code review Signed-off-by: Rueian <rueiancsie@gmail.com> --------- Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> Signed-off-by: Rueian <rueiancsie@gmail.com> Co-authored-by: Rueian <rueiancsie@gmail.com> * [RayService] auth token mode e2e test (#4225) * add ray service auth test Signed-off-by: Ryan <ryan980053@gmail.com> * try to fix the error Signed-off-by: Ryan <ryan980053@gmail.com> * add worker group Signed-off-by: Ryan <ryan980053@gmail.com> * Adjust resource requests and limits in tests Signed-off-by: Ryan Huang <ryankert01@gmail.com> * Simplify RayService auth test by removing worker group Removed worker group spec and related verification for auth token propagation in RayService tests. Signed-off-by: Ryan Huang <ryankert01@gmail.com> * Update rayservice_auth_test.go Signed-off-by: Ryan Huang <ryankert01@gmail.com> * Refactor TestRayServiceAuthToken for clarity Refactor test for RayService authentication to improve clarity and maintainability. Signed-off-by: Ryan Huang <ryankert01@gmail.com> * Update ray-operator/test/e2erayservice/rayservice_auth_test.go Co-authored-by: Han-Ju Chen (Future-Outlier) <eric901201@gmail.com> Signed-off-by: Ryan Huang <ryankert01@gmail.com> * Update ray-operator/test/e2erayservice/rayservice_auth_test.go Co-authored-by: Jun-Hao Wan <ken89@kimo.com> Signed-off-by: Ryan Huang <ryankert01@gmail.com> * address comments Signed-off-by: ryankert01 <ryan980053@gmail.com> * pre-commit check Signed-off-by: ryankert01 <ryan980053@gmail.com> * update test Signed-off-by: Future-Outlier <eric901201@gmail.com> * revert my update Signed-off-by: Future-Outlier <eric901201@gmail.com> --------- Signed-off-by: Ryan <ryan980053@gmail.com> Signed-off-by: Ryan Huang <ryankert01@gmail.com> Signed-off-by: ryankert01 <ryan980053@gmail.com> Signed-off-by: Future-Outlier <eric901201@gmail.com> Co-authored-by: Han-Ju Chen (Future-Outlier) <eric901201@gmail.com> Co-authored-by: Jun-Hao Wan <ken89@kimo.com> * [PodPool-VK] add podpool vk README (#4250) (#4251) * [PodPool-VK] add podpool vk README (#4250) * Fix lint Signed-off-by: Rueian <rueiancsie@gmail.com> * Update podpool-vk/README.md Signed-off-by: Rueian <rueiancsie@gmail.com> * fix lint Signed-off-by: Rueian <rueiancsie@gmail.com> * rename podpool-vk to podpool --------- Signed-off-by: Rueian <rueiancsie@gmail.com> Co-authored-by: Rueian <rueiancsie@gmail.com> * Bump next from 15.4.8 to 15.4.9 in /dashboard (#4264) Bumps [next](https://github.com/vercel/next.js) from 15.4.8 to 15.4.9. - [Release notes](https://github.com/vercel/next.js/releases) - [Changelog](https://github.com/vercel/next.js/blob/canary/release.js) - [Commits](https://github.com/vercel/next.js/compare/v15.4.8...v15.4.9) --- updated-dependencies: - dependency-name: next dependency-version: 15.4.9 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [Autoscaler] Add validation to require RayCluster v2 when using idleTimeoutSeconds (#4162) * add validation for idleTimeoutSeconds config per worker groups Signed-off-by: alimaazamat <alima.azamat2003@gmail.com> * Check spec version then fall back to env var Co-authored-by: Jun-Hao Wan <ken89@kimo.com> Signed-off-by: Alima Azamat <92766804+alimaazamat@users.noreply.github.com> Signed-off-by: alimaazamat <alima.azamat2003@gmail.com> --------- Signed-off-by: alimaazamat <alima.azamat2003@gmail.com> Signed-off-by: Alima Azamat <92766804+alimaazamat@users.noreply.github.com> Co-authored-by: Jun-Hao Wan <ken89@kimo.com> * Bump next from 15.4.9 to 15.4.10 in /dashboard (#4266) Bumps [next](https://github.com/vercel/next.js) from 15.4.9 to 15.4.10. - [Release notes](https://github.com/vercel/next.js/releases) - [Changelog](https://github.com/vercel/next.js/blob/canary/release.js) - [Commits](https://github.com/vercel/next.js/compare/v15.4.9...v15.4.10) --- updated-dependencies: - dependency-name: next dependency-version: 15.4.10 dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Feature/kubectl plugin/improve support for autoscaling clusters 3832 (#4146) * [kubectl-plugin] Support scaling min/max replicas in scale cluster command (#3832) Signed-off-by: AndySung320 <andysung0320@gmail.com> * add examples Signed-off-by: AndySung320 <andysung0320@gmail.com> * test(e2e): add scale and get workergroups e2e tests - Add getWorkerGroupValues helper function to support.go - Add e2e tests for 'kubectl ray scale cluster' command - Add e2e tests for 'kubectl ray get workergroups' command Signed-off-by: AndySung320 <andysung0320@gmail.com> * refactor(scale): simplify update logic for min/max/replica Refactor the update logic for minReplicas, maxReplicas, and replicas to use the final* variables directly within their respective blocks. Signed-off-by: AndySung320 <andysung0320@gmail.com> * ci: retry Signed-off-by: AndySung320 <andysung0320@gmail.com> * document default minReplicas value and use explicit numeric maxReplicas Signed-off-by: AndySung320 <andysung0320@gmail.com> * cmd/scale: improve wording and extend test coverage Signed-off-by: AndySung320 <andysung0320@gmail.com> * e2e: add error check with Expect for kubectl commands in support.go Signed-off-by: AndySung320 <andysung0320@gmail.com> --------- Signed-off-by: AndySung320 <andysung0320@gmail.com> * Revert "[Test][Autoscaler] deflaky unexpected dead actors in tests by setting max_restarts=-1 (#3700)" (#4271) This reverts commit c75997ac83b5f04669f98af2bdbb7b932f9e9a1a. * [Autoscaler] validate idleTimeoutSeconds for AutoscalerOptions (#4267) * validate idleTimeoutSeconds for workergroup spec and autoscaler options Signed-off-by: alimaazamat <alima.azamat2003@gmail.com> * remove Autoscaler Options requiring V2 autoscaler Signed-off-by: alimaazamat <alima.azamat2003@gmail.com> --------- Signed-off-by: alimaazamat <alima.azamat2003@gmail.com> * Bump glob from 10.4.5 to 10.5.0 in /dashboard (#4207) Bumps [glob](https://github.com/isaacs/node-glob) from 10.4.5 to 10.5.0. - [Changelog](https://github.com/isaacs/node-glob/blob/main/changelog.md) - [Commits](https://github.com/isaacs/node-glob/compare/v10.4.5...v10.5.0) --- updated-dependencies: - dependency-name: glob dependency-version: 10.5.0 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Add Helm values for ResourceClaims to RayCluster (#4290) * [Feature] Support JobDeploymentStatus as the deletion condition (#4262) * feat: Support JobDeploymentStatus as the deletion condition Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * chore: Regenerate utility codes Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * docs: Update api docs Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * fix(test): Change JobStatus of the deletion condition from val to ptr Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * test: Add JobDeploymentStatus-based e2e tests with four deletion policies Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * test: Add validation tests for JobDeploymentStatus-based deletion rules Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * fix: Sync CRD yaml files into helm chart Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * docs: Support JobDeploymentStatus as deletion condition Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * refactor: Add a helper to check rule match Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * docs: Complete TTLSeconds description Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * refactor: Keep validation logic aligned with kubebuilder Co-authored-by: Nary Yeh <60069744+machichima@users.noreply.github.com> Signed-off-by: 江家瑋 <36886416+JiangJiaWei1103@users.noreply.github.com> * refactor: Write helper for validating deletion condition Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * refactor: Simplify logic for assigning an empty map to track TTL by policy Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * refactor: Simplify deletion condition matching logic Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * docs: Make deletion rule uniqueness check comment more clear Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * refactor: Use explicit string type to handle both conditions Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * test: Shorten TTL to speed up e2e test Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * Revert "test: Shorten TTL to speed up e2e test" This reverts commit 0588f356bd7479b1d66eb4e53a57b525f747b12e. We need to pass consistency checks for resource preservation. Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> --------- Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> Signed-off-by: 江家瑋 <36886416+JiangJiaWei1103@users.noreply.github.com> Co-authored-by: Nary Yeh <60069744+machichima@users.noreply.github.com> * [Feat] Add Ray Cron Job (#4159) * feat: init RayCronJob controller and add CRs * feat: update status * feat: add validate * feat: add update status function * refactor: udpate LastScheduleTime type * feat: implement logic for diff ScheduleStatus * build: regen CRD * feat: correctly create reconciler and add loggings * feat: check if it's time for schedule new rayjob * feat: add raycronjob example yaml * fix: remove StatusScheduled * build: make sync * refactor: move validate to validation.go * test: add validation and raycronjob unit test * feat: add feature gate * test: update test * fix: update helm chart rules * fix: add OwnerReference from cronjob to rayjob * fix: field order for RayCronJob struct Signed-off-by: machichima <nary12321@gmail.com> * fix: remove schedule status Signed-off-by: machichima <nary12321@gmail.com> * build: make generate Signed-off-by: machichima <nary12321@gmail.com> * fix: update example yaml to use ray 2.52.0 image Signed-off-by: machichima <nary12321@gmail.com> * fix: no need to update status for validate fail Signed-off-by: machichima <nary12321@gmail.com> * test: use rayCronJobTemplate func to create rayCronJob in test Signed-off-by: machichima <nary12321@gmail.com> * build: add kubebuilder rbac config Signed-off-by: machichima <nary12321@gmail.com> * feat: extract ray cron job name to constant Signed-off-by: machichima <nary12321@gmail.com> * fix: update SetupWithManager Signed-off-by: machichima <nary12321@gmail.com> * fix: JobTemplate to normal object Signed-off-by: machichima <nary12321@gmail.com> * feat: add cron job origin expected timestamp annotation Signed-off-by: machichima <nary12321@gmail.com> * fix: set LastScheduleTime only when job is created Signed-off-by: machichima <nary12321@gmail.com> * docs: update comment in example Signed-off-by: machichima <nary12321@gmail.com> --------- Signed-off-by: machichima <nary12321@gmail.com> * [Chore] Upgrade golangci-lint to v2.7.2 and adjust linting configurations (#4007) * Upgrade golangci-lint to v2.4.0 and adjust linting configurations Signed-off-by: wei-chenglai <qazwsx0939059006@gmail.com> * disable linters and formatters * fix lint * fix makefile * fix makefile * fix config * update install link * add comment --------- Signed-off-by: wei-chenglai <qazwsx0939059006@gmail.com> * [chore] fix cronjob crd inconsistent (#4292) Signed-off-by: Rueian <rueiancsie@gmail.com> * docs: Show missing phony targets and align styles (#4295) Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * [RayCluster] Improved the efficiency when checking rayclusters' expectations (#4209) * add the implementation of historyserver collector (#4241) * add the implementation of historyserver collector update go.work go.mod Signed-off-by: KunWuLuan <kunwuluan@gmail.com> * update the func judging if the event is releated to the Nodes. Signed-off-by: KunWuLuan <kunwuluan@gmail.com> * S3FORCE_PATH_STYPE -> S3FORCE_PATH_STYLE Signed-off-by: Future-Outlier <eric901201@gmail.com> * S3DISABLE_SSL -> s3DisableSSL (camel case) Signed-off-by: Future-Outlier <eric901201@gmail.com> * Add comments to explain WatchSessionLatestLoops Signed-off-by: Future-Outlier <eric901201@gmail.com> * update Signed-off-by: Future-Outlier <eric901201@gmail.com> * update Signed-off-by: Future-Outlier <eric901201@gmail.com> * update Signed-off-by: Future-Outlier <eric901201@gmail.com> --------- Signed-off-by: KunWuLuan <kunwuluan@gmail.com> Signed-off-by: Future-Outlier <eric901201@gmail.com> Co-authored-by: Future-Outlier <eric901201@gmail.com> * [history server] Remove go.work and go.work.sum to follow Go's best practices (#4301) Signed-off-by: Future-Outlier <eric901201@gmail.com> * Clean up unused label for volcano scheduler (#4305) Signed-off-by: wei-chenglai <qazwsx0939059006@gmail.com> * fix: Return upon update error for active and pending clusters (#4273) * fix: Propagate pending cluster update err back to caller Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * fix: Return on err logic Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> --------- Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * Update head and worker pod resources in sample manifests (#4288) * Update head and worker pod resources in sample manifests Signed-off-by: Yi Chen <github@chenyicn.net> * Update kubectl plugin e2e tests Signed-off-by: Yi Chen <github@chenyicn.net> --------- Signed-off-by: Yi Chen <github@chenyicn.net> * [Chore] Upgrade Golang version to v1.25 (#4269) * chore: Bump golang version to v1.25 Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * chore: Bump crd-ref-docs to v0.2.0 for Go v1.25 Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * chore: Bump Go to v1.25.5 Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * fix: Remove patch ver for flexibility Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * chore: Switch to floating tag for building images Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> --------- Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * fix: upgrade Ray image in ray-cluster.auth.yaml to 2.53.0 to resolve dashboard 'Failed to load' error (#4310) Signed-off-by: win5923 <ken89@kimo.com> * Fix testifylint and gci lint issues (#4293) Signed-off-by: seanlaii <qazwsx0939059006@gmail.com> * [Chore] Fix gosec, govet and errcheck lint issues (#4309) Signed-off-by: win5923 <ken89@kimo.com> Signed-off-by: Rueian <rueiancsie@gmail.com> Co-authored-by: Rueian <rueiancsie@gmail.com> * [Docs] Add history server collector setup doc (#4303) * docs: Add history server log collector setup guide Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * docs: Update fig links Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * docs: Add minio and raycluster yamls Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * fix: Remove eventserver dependencies and correct s3 env var Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * docs: Udpate PR target Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * docs: Remove platform options for local build Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * update Signed-off-by: Future-Outlier <eric901201@gmail.com> * revert: Make this PR focused on Collector setup guide only Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * docs: Support collector-only setup Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * docs: Verify events are uploaded to the blob storage Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * docs: Fix fig link Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * update Signed-off-by: Future-Outlier <eric901201@gmail.com> * Trigger CI Signed-off-by: Future-Outlier <eric901201@gmail.com> --------- Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> Signed-off-by: Future-Outlier <eric901201@gmail.com> Co-authored-by: Future-Outlier <eric901201@gmail.com> * [Chore] Enable modernize linter (#4317) Signed-off-by: seanlaii <qazwsx0939059006@gmail.com> * [chore] Fix errorlint lint issues (#4306) Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> Signed-off-by: Jun-Hao Wan <ken89@kimo.com> Signed-off-by: Rueian <rueiancsie@gmail.com> Co-authored-by: Jun-Hao Wan <ken89@kimo.com> Co-authored-by: Rueian <rueiancsie@gmail.com> * [Feat] Cron job add suspend (#4313) * feat: add Suspend to raycronjob type Signed-off-by: machichima <nary12321@gmail.com> * feat: add suspend Signed-off-by: machichima <nary12321@gmail.com> * docs: update example Signed-off-by: machichima <nary12321@gmail.com> * build: make sync Signed-off-by: machichima <nary12321@gmail.com> * refactor: event type name to SuspendedRayCronJob Signed-off-by: machichima <nary12321@gmail.com> * refactor: add back omitempty Signed-off-by: machichima <nary12321@gmail.com> * fix: precommit Signed-off-by: machichima <nary12321@gmail.com> * better test Signed-off-by: Future-Outlier <eric901201@gmail.com> * fix: log suspend log once only Signed-off-by: machichima <nary12321@gmail.com> --------- Signed-off-by: machichima <nary12321@gmail.com> Signed-off-by: Future-Outlier <eric901201@gmail.com> Co-authored-by: Future-Outlier <eric901201@gmail.com> * [Feature] Support recreate pods for RayCluster using RayClusterSpec.upgradeStrategy (#4185) * [Feature] Support recreate pods for RayCluster using RayClusterSpec Signed-off-by: win5923 <ken89@kimo.com> * Add test Signed-off-by: win5923 <ken89@kimo.com> * improve readability Signed-off-by: win5923 <ken89@kimo.com> * Remove deepcopy in GeneratePodTemplateHash Signed-off-by: win5923 <ken89@kimo.com> * Refactor ValidateRayClusterUpgradeOptions Signed-off-by: win5923 <ken89@kimo.com> * add kubebuilder:validation Signed-off-by: win5923 <ken89@kimo.com> * Rename the RayServiceUpgradeType and RayClusterUpgradeType constants Signed-off-by: win5923 <ken89@kimo.com> * add ray.io/kuberay-version annotations for head pod and worker pods Signed-off-by: win5923 <ken89@kimo.com> * Update ray-operator/controllers/ray/common/pod.go Co-authored-by: Nary Yeh <60069744+machichima@users.noreply.github.com> Signed-off-by: Jun-Hao Wan <ken89@kimo.com> * Revert "add ray.io/kuberay-version annotations for head pod and worker pods" This reverts commit 5f3afb37724896ee2ae13399ab3d48d26fb6719f. * add rayClusterScaleExpectation.Delete for deleteAllPods Signed-off-by: win5923 <ken89@kimo.com> * Apply suggestions Signed-off-by: win5923 <ken89@kimo.com> * better logic Signed-off-by: Future-Outlier <eric901201@gmail.com> * solve ci err Signed-off-by: Future-Outlier <eric901201@gmail.com> * update Signed-off-by: Future-Outlier <eric901201@gmail.com> * better yaml file Signed-off-by: Future-Outlier <eric901201@gmail.com> * Commented out upgradeStrategy for sample yaml Signed-off-by: win5923 <ken89@kimo.com> * Update container image for TestRayClusterUpgradeStrategy test Signed-off-by: win5923 <ken89@kimo.com> * Compare RayClusterSpec Signed-off-by: win5923 <ken89@kimo.com> * Remove WorkerGroupSpecs.IdleTimeoutSeconds and Suspend to follow RayService's solution Signed-off-by: win5923 <ken89@kimo.com> * Follow RayService's solution Signed-off-by: win5923 <ken89@kimo.com> * Trigger CI Signed-off-by: Future-Outlier <eric901201@gmail.com> * update the head pod to get the cluster hash and new KubeRay version when KubeRay version changed Signed-off-by: win5923 <ken89@kimo.com> * Use UpgradeStrategyRecreateHashKey annotations for RayCluster upgradeStrategy Signed-off-by: win5923 <ken89@kimo.com> --------- Signed-off-by: win5923 <ken89@kimo.com> Signed-off-by: Jun-Hao Wan <ken89@kimo.com> Signed-off-by: Future-Outlier <eric901201@gmail.com> Co-authored-by: Nary Yeh <60069744+machichima@users.noreply.github.com> Co-authored-by: Future-Outlier <eric901201@gmail.com> * [Bug] Fix health probes to use custom ports from rayStartParams (#4041) * fix * add a new "test" for gofumpt * init * add pod test * add it test * checkstyle * Update ray-operator/controllers/ray/common/pod_test.go Co-authored-by: Jun-Hao Wan <ken89@kimo.com> Signed-off-by: Itami Sho <42286868+MiniSho@users.noreply.github.com> * remove unnceccary code Signed-off-by: Future-Outlier <eric901201@gmail.com> * Trigger CI Signed-off-by: Future-Outlier <eric901201@gmail.com> * Trigger CI Signed-off-by: Future-Outlier <eric901201@gmail.com> --------- Signed-off-by: Itami Sho <42286868+MiniSho@users.noreply.github.com> Signed-off-by: Future-Outlier <eric901201@gmail.com> Co-authored-by: Jun-Hao Wan <ken89@kimo.com> Co-authored-by: Future-Outlier <eric901201@gmail.com> * [Refactor] Remove duplicate function in e2eautoscaler/support.go and e2erayservice/support.go by reusing test/support/support.go implementations to improve maintainability and reduce redundancy. Related to #3932 (#4038) Signed-off-by: HSIU-CHI LIU (Tomlord) <aa123593465@gmail.com> Signed-off-by: Hsiu-Chi Liu (Tomlord) <79390871+Tomlord1122@users.noreply.github.com> * [Test] [history server] [collector] Add collector e2e tests (#4308) * docs: Add history server log collector setup guide Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * docs: Update fig links Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * docs: Add minio and raycluster yamls Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * fix: Remove eventserver dependencies and correct s3 env var Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * docs: Udpate PR target Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * test: Add the log collector happy path e2e Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * test: Integrate history server log collector to CI Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * docs: Add script comment Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * refactor: Remove hardcoded consts and add a helper for s3 client Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * refactor: Align function name Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * test: Ensure test isolation by deleting S3 bucket Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * test: Upload logs during runtime Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * fix: Increase ray job timeout to avoid CI flakiness Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * fix: Increase timeout for readiness of MinIO and Ray cluster\ to avoid CI flakiness Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * fix: Refetch head pod to avoid CI flakiness Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * revert: Add eventserver back and recover Dockerfile and Makefile Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * refactor: Extract apply Ray job to cluster as helper Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * test: Check logs and node_events are uploaded on del Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * revert: Add back blank test to prevent conflicts Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * docs: Refine comments to clarify intention Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * refactor: Add prepareTestEnv helper fn Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * fix: Use eventually to avoid CI flakiness Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * refactor: Enable multi assertions Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * refactor: Cleanup test assertion logic and reuse existing utils Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * docs: Complete all func doc string Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * test: Test logs key exists during runtime Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * refactor: Extract s3 session dir check as a helper for reusability Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * revert: Toleration is sufficient for programmatic data movement Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * fix: Tolerate pod exec err Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * revert: Debug Pod or container restarts Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * fix: Add missing return Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * test: Debug residual state from the first test Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * Revert "test: Debug residual state from the first test" This reverts commit 5b820fffea7c7ec3d2aca946ec4d433f2846c914. * test: Separate ns for subtest Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * refactor: Use existing utils Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * refactor: Cleanup debug legacy Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * test: Verify logs and node_events have contents Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * fix: Exec pod cmd before cluster is deleted Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * add TODO Signed-off-by: Future-Outlier <eric901201@gmail.com> * fix: Keep port-forward for accessing S3 outside the cluster Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * refactor: Simplify get session ID logic Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * test: Move logs by ray-head container startup cmd Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * docs: Complete comments Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * test: Check logs and events must exist Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * docs: Clarify test logic Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * fix: Seperate Test obj Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * update Signed-off-by: Future-Outlier <eric901201@gmail.com> * refactor: Wrap subtests in a loop Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * docs: Clarify the intention of kill 1 command Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * docs: Further clarify the end goal of forcing OOMKilled Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * refactor: Remove redundant cleanup Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * refactor: Clear S3 session verification logic flow Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * test: Check old session dir exists in prev-logs and persit-complete-logs Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * docs: Make test case description clear Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * better name for test 2 Signed-off-by: Future-Outlier <eric901201@gmail.com> --------- Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> Signed-off-by: Future-Outlier <eric901201@gmail.com> Co-authored-by: Future-Outlier <eric901201@gmail.com> * [history server][collector] Remove unused function processAllLogs (#4316) * remove also * remove processPrevLogsOnShutdown Signed-off-by: Future-Outlier <eric901201@gmail.com> --------- Signed-off-by: Future-Outlier <eric901201@gmail.com> Co-authored-by: Future-Outlier <eric901201@gmail.com> * [Feat][kubectl-plugin] Add shell completion for for kubectl ray get [workergroups|nodes] (#4291) * [kubectl-plugin][WIP] Add shell completion for and Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * [Fix] Skip resource fetching for shell completion with --all-namespaces Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * [Fix] Remove redundant namespace check in WorkerGroupCompletionFunc Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * [Fix] Improve comments in shell completion functions Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * [Refactor][Test] Refactor WorkerGroupCompletionFunc and NodeCompletionFunc to accept client.Client parameter for testability. Add initial unit tests. Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * [Test] Add unit tests for completion function edge cases Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * [Fix][kubectl-plugin] Remove the redundant namespace check Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: JustinYeh <justinyeh1995@gmail.com> * [kubectl-plugin] Improve completion with FieldSelector filtering and workergroup deduplication Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * [kubectl-plugin][Refactor] Use labels.Set for label selector formatting Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * [Chore] fix typo in the help text for all-namesapces flag Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * [Chore][kubectl-plugin] Fix struct field alignment in tests Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> --------- Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> Signed-off-by: JustinYeh <justinyeh1995@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * [Chore] Fix staticcheck lint errors (#4326) Signed-off-by: wei-chenglai <qazwsx0939059006@gmail.com> * chore: Bump up KuberayUpgradeVersion default version for e2e test (#4331) Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * [Config] Change all RayCluster headGroupSpec limit memory to 5Gi (#4328) * chore: adjust memory < 5Gi limit in sample files Signed-off-by: Cheyu Wu <cheyu1220@gmail.com> * chore: align config format Signed-off-by: CheyuWu <cheyu1220@gmail.com> * chore: set memory to 5Gi Signed-off-by: CheyuWu <cheyu1220@gmail.com> --------- Signed-off-by: Cheyu Wu <cheyu1220@gmail.com> Signed-off-by: CheyuWu <cheyu1220@gmail.com> * [Chore] Fix noctx, revive lint issues (#4333) * [Chore] Fix noctx linter violations across codebase Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * [Chore][revive][1/N] Fix revive linter violations across codebase, fix var-naming don't use underscores in Go names & avoid meaningless package names Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * [Chore][revive][2/N] Fix revive linter violations across codebase, fix unexported-return issues Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * [Chore] Trigger CI Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> --------- Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * [historyserver] Ensure at least one worker in sample RayCluster (#4330) * [historyserver] Ensure at least one worker in sample RayCluster Signed-off-by: Future-Outlier <eric901201@gmail.com> * Trigger CI Signed-off-by: Future-Outlier <eric901201@gmail.com> * Update historyserver/config/raycluster.yaml Co-authored-by: Jun-Hao Wan <ken89@kimo.com> Signed-off-by: Han-Ju Chen (Future-Outlier) <eric901201@gmail.com> --------- Signed-off-by: Future-Outlier <eric901201@gmail.com> Signed-off-by: Han-Ju Chen (Future-Outlier) <eric901201@gmail.com> Co-authored-by: Jun-Hao Wan <ken89@kimo.com> * docs: Clarify multi-arch phony comments (#4311) Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * historyserver: remove unused function in RayLogHandler (#4336) Signed-off-by: AndySung320 <andysung0320@gmail.com> * [history server][collector] Fix getJobID for job event collection (#4342) * [historyserver] Fix getJobID for job event collection Signed-off-by: Future-Outlier <eric901201@gmail.com> * add jia-wei as co-author, since he debug with me together Signed-off-by: Future-Outlier <eric901201@gmail.com> Co-authored-by: Jia-Wei Jiang <waynechuang97@gmail.com> * remove unused code Signed-off-by: Future-Outlier <eric901201@gmail.com> * update rueian's advice Signed-off-by: Future-Outlier <eric901201@gmail.com> * add task profile event example Signed-off-by: Future-Outlier <eric901201@gmail.com> * revert back oneof solution Signed-off-by: Future-Outlier <eric901201@gmail.com> * add task profile event Signed-off-by: Future-Outlier <eric901201@gmail.com> * update rueian's advice Signed-off-by: Future-Outlier <eric901201@gmail.com> * a worked version in ray 2.52.0 Signed-off-by: Future-Outlier <eric901201@gmail.com> --------- Signed-off-by: Future-Outlier <eric901201@gmail.com> Co-authored-by: Jia-Wei Jiang <waynechuang97@gmail.com> * chore: Use double quoted resource values in sample manifest files. (#4339) * [history server] move storage interface (#4302) * historyserver: move storage and update imports Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com> * vet & fmt. Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com> * chroe: update code structure and move storage to interface Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com> --------- Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com> * background goroutine get job info (#4160) * [RayJob] background job info poc * [RayJob] add implement some methods * [RayJob] encapsulate the worker pool * [RayJob] replace concurrency map with lru cache * [RayJob] remove cache on stop and config flag * [RayJob] expiry cache cleanup goroutine Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] code and comment minor fix Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] task check contain or not befor add Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] remove delete cache from deleteClusterResources and add lock for cache Signed-off-by: fscnick <fscnick.dev@gmail.com> * [Helm] add argument for useBackgroundGoroutine Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] repeated error did not update Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] remove unused function and background goroutine observability Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] cache client support graceful shutdown Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] rename useBackgroundGoroutine to asyncJobInfoQuery Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] use ray job info in logger Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] remove cacheStorage nil check Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] bg goroutine uses operator context instead Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] bg goroutine handle task queue full Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] correct the comment Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] refactor initialize dashboard client for background goroutine Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] worker handle ctx.Done correctly Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] remove unnecessary putting task into queue * [RayJob] if queue is full, retry again * [RayJob] make cache immutable to avoid data race Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] remove unused function Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] remove cacheStorage lock Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] update cache error Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] If error on fetching job info, it removes from loop Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] task queue is extendable Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] change slice to ring buffer Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] rename PutTask to AddTask * [RayJob] extendable channel use open source library Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] async job info query use feature gate instead Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] add comment for task Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] rename function signature of worker pool init function Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] change ErrAgain error message Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] fix lint error Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] change back to EAGAIN Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] remove queue size from todo comment Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] rename queue full error Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] add lock to avoid data race Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] requeue check context has canceled or not Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] add cluster name on the cache key Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] check raycluster is nil or not when initializing the dashboard client Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] avoid send to a block channel when graceful shutdown Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] use contain to check the placeholder at the beginning of task Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] graceful shutdown avoid panic from a nil task * [RayJob] fix channel receive condition Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] fix nil rayCluster in dashboard cache client Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] remove with name from log for sharing purpose Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] remove checkname to avoid collision Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] add task with blocking send Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] remove unused error Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayJob] provide raycluster name if it is absent for removing cache Signed-off-by: fscnick <fscnick.dev@gmail.com> --------- Signed-off-by: fscnick <fscnick.dev@gmail.com> * [kubectl-plugin][Test] Use client-go reactors for FieldSelector filtering in fake client tests (#4361) * feat(kubectl-plugin): add FieldSelector reactor helper for fake client tests Add AddRayClusterFieldSelectorReactor helper that simulates server-side FieldSelector filtering in fake client tests. This addresses issue #4337 by allowing tests to verify filtering behavior without manual name checks. Refs: #4337 * test(kubectl-plugin): apply FieldSelector reactor to completion tests Use the new AddRayClusterFieldSelectorReactor in workergroup completion tests to properly simulate server-side filtering behavior. Refs: #4337 * refactor(kubectl-plugin): remove manual name filtering in completion Remove the workaround that manually filtered clusters by name since the fake client now properly supports FieldSelector filtering via reactors. Refs: #4337 * chore(kubectl-plugin): migrate from NewSimpleClientset to NewClientset NewSimpleClientset is deprecated in favor of NewClientset for better server-side apply testing support. Note: scale_cluster_test.go is not migrated because it uses Update operations that require schema definitions missing from the generated applyconfiguration internal schema. Refs: #4337 * refactor(kubectl-plugin): add NewRayClientset wrapper for simpler test setup Add a convenience wrapper that creates a fake Ray clientset with FieldSelector reactor pre-configured. This simplifies test setup and ensures consistent behavior across tests. Also applies reactor to get_cluster_test.go and fixes test data to match actual cluster names now that FieldSelector properly filters. Refs: #4337 * Update kubectl-plugin/pkg/util/client/testing/reactor.go Co-authored-by: JustinYeh <justinyeh1995@gmail.com> Signed-off-by: Ikenna <ikennachifo@gmail.com> * fix(kubectl-plugin): update references to renamed reactor function Update clientset.go and reactor_test.go to use the renamed function AddRayClusterListFieldSelectorReactor. --------- Signed-off-by: Ikenna <ikennachifo@gmail.com> Co-authored-by: JustinYeh <justinyeh1995@gmail.com> * Support Multi-Arch Image in CI (#4348) * Add KuberayTestArch environment variable for architecture override in tests This commit introduces a new environment variable KuberayTestArch that allows overriding the detected architecture in test environments. Previously, the system only relied on runtime.GOARCH to determine if ARM64 architecture was being used, but this change enables explicit architecture specification through the environment variable. This is particularly useful for testing scenarios where you want to force a specific architecture regardless of the actual runtime environment, improving test flexibility and consistency across different platforms. Signed-off-by: KunWuLuan <kunwuluan@gmail.com> * update Signed-off-by: Future-Outlier <eric901201@gmail.com> --------- Signed-off-by: KunWuLuan <kunwuluan@gmail.com> Signed-off-by: Future-Outlier <eric901201@gmail.com> Co-authored-by: Future-Outlier <eric901201@gmail.com> * generate clientset with 1.35 code-generator (#4347) * generate clientset with 1.35 code-generator Signed-off-by: KunWuLuan <kunwuluan@gmail.com> * Update codegen script to use go list for k8s.io/code-generator path resolution Signed-off-by: KunWuLuan <kunwuluan@gmail.com> * Run make sync --------- Signed-off-by: KunWuLuan <kunwuluan@gmail.com> * [master] Fix Ray CI integration for release automation (#4370) * push Signed-off-by: Future-Outlier <eric901201@gmail.com> * tesst Signed-off-by: Future-Outlier <eric901201@gmail.com> * update Signed-off-by: Future-Outlier <eric901201@gmail.com> * update Signed-off-by: Future-Outlier <eric901201@gmail.com> * update Signed-off-by: Future-Outlier <eric901201@gmail.com> * test Signed-off-by: Future-Outlier <eric901201@gmail.com> * finally all good Signed-off-by: Future-Outlier <eric901201@gmail.com> * update Signed-off-by: Future-Outlier <eric901201@gmail.com> --------- Signed-off-by: Future-Outlier <eric901201@gmail.com> * [history server] Web Server + Event Processor (#4329) * Add event server for history server. Co-authored-by: chiayi chiayiliang327@gmail.com Co-authored-by: KunWuLuan kunwuluan@gmail.com * Update test * [history server] Web Server Signed-off-by: Future-Outlier <eric901201@gmail.com> * add Kun Wu's setting Signed-off-by: Future-Outlier <eric901201@gmail.com> Co-authored-by: KunWuLuan <kunwuluan@gmail.com> * a worked version Signed-off-by: Future-Outlier <eric901201@gmail.com> * a worked version, will revise it Signed-off-by: Future-Outlier <eric901201@gmail.com> * Trigger CI Signed-off-by: Future-Outlier <eric901201@gmail.com> * merge master Signed-off-by: Future-Outlier <eric901201@gmail.com> * turn chinese comments to english Signed-off-by: Future-Outlier <eric901201@gmail.com> * fix bugs and make dead cluster endpoint work or return not yet supported Signed-off-by: Future-Outlier <eric901201@gmail.com> * support task summarize, not yet test live cluster Signed-off-by: Future-Outlier <eric901201@gmail.com> * support predicate Signed-off-by: Future-Outlier <eric901201@gmail.com> * remove license Signed-off-by: Future-Outlier <eric901201@gmail.com> * fix Stop signal ignored during hour-long sleep period Signed-off-by: Future-Outlier <eric901201@gmail.com> * fix Main exits without waiting for graceful shutdown Signed-off-by: Future-Outlier <eric901201@gmail.com> * remove log key info Signed-off-by: Future-Outlier <eric901201@gmail.com> * fix Graceful shutdown incorrectly treated as fatal error Signed-off-by: Future-Outlier <eric901201@gmail.com> * fix Event processor failure causes event processing to block Signed-off-by: Future-Outlier <eric901201@gmail.com> * Fix Task update discards all fields except attempt number, but this is short term fix, we should use list Signed-off-by: Future-Outlier <eric901201@gmail.com> * fix max clusters default 0 problem, and add todo Signed-off-by: Future-Outlier <eric901201@gmail.com> * fix Missing cookie path causes repeated Kubernetes API calls Signed-off-by: Future-Outlier <eric901201@gmail.com> * fix task list problems Signed-off-by: Future-Outlier <eric901201@gmail.com> * add actor json tag Signed-off-by: Future-Outlier <eric901201@gmail.com> * handle task lifecycle event, need to update to binary search Signed-off-by: Future-Outlier <eric901201@gmail.com> * change upsert to merge Signed-off-by: Future-Outlier <eric901201@gmail.com> * handle task and actor endpoint better, make them complete Signed-off-by: Future-Outlier <eric901201@gmail.com> * fix SSRF via user-controlled service name cookie Signed-off-by: Future-Outlier <eric901201@gmail.com> * actor and task need to solve Duplicate events appended on each hourly reprocessing cycle Signed-off-by: Future-Outlier <eric901201@gmail.com> * solve Duplicate events appended on each hourly reprocessing cycle Signed-off-by: Future-Outlier <eric901201@gmail.com> * fix Unchecked type assertions can cause panics Signed-off-by: Future-Outlier <eric901201@gmail.com> * HTTP proxy requests lack timeout causing potential hangs Signed-off-by: Future-Outlier <eric901201@gmail.com> * fix Nil map panic when processing null event entries Signed-off-by: Future-Outlier <eric901201@gmail.com> * fix Environment variable bypasses SSRF protection for live cluster proxying Signed-off-by: Future-Outlier <eric901201@gmail.com> * support required resources and server timeout error Signed-off-by: Future-Outlier <eric901201@gmail.com> * better serviceaccount Signed-off-by: Future-Outlier <eric901201@gmail.com> * Add Readme Signed-off-by: Future-Outlier <eric901201@gmail.com> * better comments for log dir path Signed-off-by: Future-Outlier <eric901201@gmail.com> * fix race condition Signed-off-by: Future-Outlier <eric901201@gmail.com> * better const explaination for seperator connector Signed-off-by: Future-Outlier <eric901201@gmail.com> * 1 better actor response; 2 cleanup dead code Signed-off-by: Future-Outlier <eric901201@gmail.com> * remove dead code Signed-off-by: Future-Outlier <eric901201@gmail.com> * update Signed-off-by: Future-Outlier <eric901201@gmail.com> * fix comments Signed-off-by: Future-Outlier <eric901201@gmail.com> --------- Signed-off-by: Future-Outlier <eric901201@gmail.com> Co-authored-by: Aaron Liang <aaronliang@google.com> Co-authored-by: KunWuLuan <kunwuluan@gmail.com> * [Bug][RayJob] Sidecar mode shouldn't restart head pod when head pod is deleted (#4234) * [Bug][RayJob] Sidecar mode shouldn't restart head pod when head pod is deleted Signed-off-by: 400Ping <fourhundredping@gmail.com> * [fix] fix CI error Signed-off-by: 400Ping <fourhundredping@gmail.com> * update Signed-off-by: 400Ping <fourhundredping@gmail.com> * reunite if statement Signed-off-by: 400Ping <fourhundredping@gmail.com> * update Signed-off-by: 400Ping <fourhundredping@gmail.com> * fix ci error Signed-off-by: 400Ping <fourhundredping@gmail.com> * fix Signed-off-by: 400Ping <fourhundredping@gmail.com> * put back unnecessary comment deletion Signed-off-by: 400Ping <fourhundredping@gmail.com> * Better rayjob logic Signed-off-by: Future-Outlier <eric901201@gmail.com> * update Signed-off-by: Future-Outlier <eric901201@gmail.com> * update Signed-off-by: Future-Outlier <eric901201@gmail.com> * update Signed-off-by: Future-Outlier <eric901201@gmail.com> * Update ray-operator/test/e2erayjob/rayjob_test.go Co-authored-by: Jun-Hao Wan <ken89@kimo.com> Signed-off-by: Ping <fourhundredping@gmail.com> * Update ray-operator/test/e2erayjob/rayjob_test.go Co-authored-by: Nary Yeh <60069744+machichima@users.noreply.github.com> Signed-off-by: Ping <fourhundredping@gmail.com> * update rayjob test Signed-off-by: 400Ping <fourhundredping@gmail.com> * fix merge conflict error Signed-off-by: 400Ping <fourhundredping@gmail.com> * Update ray-operator/test/e2erayjob/rayjob_sidecar_mode_test.go Co-authored-by: fscnick <6858627+fscnick@users.noreply.github.com> Signed-off-by: Ping <fourhundredping@gmail.com> * update Signed-off-by: 400Ping <fourhundredping@gmail.com> * revert reason assertion Signed-off-by: 400Ping <fourhundredping@gmail.com> * [chore] retrigger ci * update Signed-off-by: 400Ping <fourhundredping@gmail.com> * [chore] change from HeadPod to GetHeadPod Signed-off-by: 400Ping <fourhundredping@gmail.com> * add submission mode label key label Signed-off-by: Future-Outlier <eric901201@gmail.com> * Update ray-operator/controllers/ray/utils/constant.go Co-authored-by: Rueian <rueiancsie@gmail.com> Signed-off-by: Ping <fourhundredping@gmail.com> * Update ray-operator/controllers/ray/raycluster_controller.go Co-authored-by: Rueian <rueiancsie@gmail.com> Signed-off-by: Ping <fourhundredping@gmail.com> * Update ray-operator/controllers/ray/raycluster_controller.go Co-authored-by: Rueian <rueiancsie@gmail.com> Signed-off-by: Ping <fourhundredping@gmail.com> * Update ray-operator/controllers/ray/rayjob_controller.go Co-authored-by: Rueian <rueiancsie@gmail.com> Signed-off-by: Ping <fourhundredping@gmail.com> * Update ray-operator/controllers/ray/rayjob_controller.go Co-authored-by: Rueian <rueiancsie@gmail.com> Signed-off-by: Ping <fourhundredping@gmail.com> * Update ray-operator/controllers/ray/utils/constant.go Co-authored-by: Nary Yeh <60069744+machichima@users.noreply.github.com> Signed-off-by: Ping <fourhundredping@gmail.com> * Update ray-operator/controllers/ray/rayjob_controller.go Co-authored-by: Rueian <rueiancsie@gmail.com> Signed-off-by: Ping <fourhundredping@gmail.com> * update Signed-off-by: 400Ping <fourhundredping@gmail.com> * Add missing label Signed-off-by: 400Ping <fourhundredping@gmail.com> * update Signed-off-by: 400Ping <fourhundredping@gmail.com> * update Signed-off-by: 400Ping <fourhundredping@gmail.com> --------- Signed-off-by: 400Ping <fourhundredping@gmail.com> Signed-off-by: Future-Outlier <eric901201@gmail.com> Signed-off-by: Ping <fourhundredping@gmail.com> Co-authored-by: Future-Outlier <eric901201@gmail.com> Co-authored-by: Jun-Hao Wan <ken89@kimo.com> Co-authored-by: Nary Yeh <60069744+machichima@users.noreply.github.com> Co-authored-by: fscnick <6858627+fscnick@users.noreply.github.com> Co-authored-by: Rueian <rueiancsie@gmail.com> * [Refactor] [Test] Add helpers and use auto cleanup for testing the RayJob deletion strategy (#4363) * refactor: Extract helpers and separate ns for auto cleanup Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * docs: Add logs Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * docs: Add logs to make it easier to track test flow Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * test: Check Ray job is running Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> --------- Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * Change Ray/Kuberay Google Calendar and Kuberay Sync link (#4401) Signed-off-by: Future-Outlier <eric901201@gmail.com> * [historyserver][collector] Add file-level idempotency check for prev-logs processing on container restart (#4321) * feat(historyserver):re-push prev-logs on pod restart Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com> * chroe(historyserver): replace hard code path Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com> * fmt. Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com> * test(historyserver): add test for logcollector restart Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com> * add e2e test for repush Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com> * fix e2e test. Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com> * add Troubleshooting. Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com> * rm redundant cleanup. Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com> * reuse WatchPrevLogsLoops to scan existing logs. Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com> * simulate partial upload in e2e test. Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com> * fix unit test. Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com> * fix lint Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com> * fix mv race condition in e2e test. Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com> * Apply suggestion from @JiangJiaWei1103 Co-authored-by: 江家瑋 <36886416+JiangJiaWei1103@users.noreply.github.com> Signed-off-by: yi wang <48236141+my-vegetable-has-exploded@users.noreply.github.com> * address comments. Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com> * e2e test: add assertions and update description Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com> * Better test Signed-off-by: Future-Outlier <eric901201@gmail.com> --------- Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com> Signed-off-by: yi wang <48236141+my-vegetable-has-exploded@users.noreply.github.com> Signed-off-by: Future-Outlier <eric901201@gmail.com> Co-authored-by: 江家瑋 <36886416+JiangJiaWei1103@users.noreply.github.com> Co-authored-by: Future-Outlier <eric901201@gmail.com> * [Docs] [history server] Create service account for history server deployment (#4396) * docs: Add creating sa Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> * better readme Signed-off-by: Future-Outlier <eric901201@gmail.com> --------- Signed-off-by: JiangJiaWei1103 <waynechuang97@gmail.com> Signed-off-by: Future-Outlier <eric901201@gmail.com> Co-authored-by: Future-Outlier <eric901201@gmail.com> * [Docs][History Server] update instructions for live cluster section (#4408) * docs: update for live cluster section Signed-off-by: machichima <nary12321@gmail.com> * docs: clearer description Co-authored-by: 江家瑋 <36886416+JiangJiaWei1103@users.noreply.github.com> Signed-off-by: Nary Yeh <60069744+machichima@users.noreply.github.com> --------- Signed-off-by: machichima <nary12321@gmail.com> Signed-off-by: Nary Yeh <60069744+machichima@users.noreply.github.com> Co-authored-by: 江家瑋 <36886416+JiangJiaWei1103@users.noreply.github.com> * [Test] [history server] [collector] Ensure event type coverage (#4343) * [historyserver] Fix getJobID for job event collection Signed-off-by: Future-Outlier <eric901201@gmail.com> * add jia-wei as co-author, since he debug with me together Signed-off-by: Future-Outlier <eric901201@gmail.com> Co-authored-by: Jia-Wei Jiang <waynechuang97@gmail.com> * remove unused code Signed-off-by: Future-Outlier <eric901201@gmail.com> * update rueian's advice Signed-off-by: Future-Outlier <eric901201@gmail.com> * add task profile event ex…
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
Fix CI failures.
Related issue number
Checks