[Serve] Fix sort bug and optimize ClusterNodeInfoCache.update() by abrarsheikh · Pull Request #60843 · ray-project/ray

abrarsheikh · 2026-02-08T03:14:49Z

ClusterNodeInfoCache.update() had three issues:

Sort bug (correctness): sorted(alive_nodes) returns a new sorted list that was immediately discarded. The cached alive_nodes were stored in arbitrary dict-iteration order, violating the documented invariant of deterministic NodeID ordering. Fixed by using alive_nodes.sort().
Redundant GCS RPC: ray._private.state.available_resources_per_node() goes through the legacy GlobalStateAccessor path, which opens a second connection to GCS and performs a full protobuf serialize/deserialize round-trip per node — every tick. Replaced with self._gcs_client.get_all_resource_usage() which reuses the existing GcsClient and extracts resources_available from the ResourcesData batch.
Unconditional cache rebuild: Node labels and resources_total are static per-node properties that don't change while a node stays alive, yet they were rebuilt from scratch every tick. Added change detection via a frozenset of alive node IDs — labels and total resources are only rebuilt when cluster membership actually changes.

Related to #60680

Signed-off-by: abrar <abrar@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces several valuable improvements to ClusterNodeInfoCache.update(). It correctly fixes a sorting bug, and significantly optimizes performance by reducing redundant GCS RPCs and avoiding unconditional cache rebuilds. The implementation is clear and robust, with good error handling for GCS communication failures. I have one minor suggestion to make a part of the new logic more concise and Pythonic.

Signed-off-by: abrar <abrar@anyscale.com>

…project#60843) `ClusterNodeInfoCache.update()` had three issues: 1. **Sort bug (correctness):** `sorted(alive_nodes)` returns a new sorted list that was immediately discarded. The cached `alive_nodes` were stored in arbitrary dict-iteration order, violating the documented invariant of deterministic NodeID ordering. Fixed by using `alive_nodes.sort()`. 2. **Redundant GCS RPC:** `ray._private.state.available_resources_per_node()` goes through the legacy `GlobalStateAccessor` path, which opens a second connection to GCS and performs a full protobuf serialize/deserialize round-trip per node — every tick. Replaced with `self._gcs_client.get_all_resource_usage()` which reuses the existing `GcsClient` and extracts `resources_available` from the `ResourcesData` batch. 3. **Unconditional cache rebuild:** Node labels and `resources_total` are static per-node properties that don't change while a node stays alive, yet they were rebuilt from scratch every tick. Added change detection via a `frozenset` of alive node IDs — labels and total resources are only rebuilt when cluster membership actually changes. Related to ray-project#60680 --------- Signed-off-by: abrar <abrar@anyscale.com> Signed-off-by: Adel Nour <ans9868@nyu.edu>

…project#60843) `ClusterNodeInfoCache.update()` had three issues: 1. **Sort bug (correctness):** `sorted(alive_nodes)` returns a new sorted list that was immediately discarded. The cached `alive_nodes` were stored in arbitrary dict-iteration order, violating the documented invariant of deterministic NodeID ordering. Fixed by using `alive_nodes.sort()`. 2. **Redundant GCS RPC:** `ray._private.state.available_resources_per_node()` goes through the legacy `GlobalStateAccessor` path, which opens a second connection to GCS and performs a full protobuf serialize/deserialize round-trip per node — every tick. Replaced with `self._gcs_client.get_all_resource_usage()` which reuses the existing `GcsClient` and extracts `resources_available` from the `ResourcesData` batch. 3. **Unconditional cache rebuild:** Node labels and `resources_total` are static per-node properties that don't change while a node stays alive, yet they were rebuilt from scratch every tick. Added change detection via a `frozenset` of alive node IDs — labels and total resources are only rebuilt when cluster membership actually changes. Related to ray-project#60680 --------- Signed-off-by: abrar <abrar@anyscale.com>

[Serve] Fix sort bug and optimize ClusterNodeInfoCache.update()

8eca934

Signed-off-by: abrar <abrar@anyscale.com>

abrarsheikh requested a review from a team as a code owner February 8, 2026 03:14

gemini-code-assist Bot reviewed Feb 8, 2026

View reviewed changes

Comment thread python/ray/serve/_private/cluster_node_info_cache.py Outdated

remove doc

00c1e17

Signed-off-by: abrar <abrar@anyscale.com>

abrarsheikh added the go add ONLY when ready to merge, run all tests label Feb 8, 2026

use valrus op

06e8118

Signed-off-by: abrar <abrar@anyscale.com>

harshit-anyscale approved these changes Feb 9, 2026

View reviewed changes

abrarsheikh merged commit 641d4e5 into master Feb 9, 2026
6 checks passed

abrarsheikh deleted the 60680-abrar-gcs branch February 9, 2026 17:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Serve] Fix sort bug and optimize ClusterNodeInfoCache.update()#60843

[Serve] Fix sort bug and optimize ClusterNodeInfoCache.update()#60843
abrarsheikh merged 3 commits into
masterfrom
60680-abrar-gcs

abrarsheikh commented Feb 8, 2026

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Labels

2 participants

Uh oh!

Conversation

abrarsheikh commented Feb 8, 2026

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Labels

2 participants