Skip to content

[Serve] Fix sort bug and optimize ClusterNodeInfoCache.update()#60843

Merged
abrarsheikh merged 3 commits into
masterfrom
60680-abrar-gcs
Feb 9, 2026
Merged

[Serve] Fix sort bug and optimize ClusterNodeInfoCache.update()#60843
abrarsheikh merged 3 commits into
masterfrom
60680-abrar-gcs

Conversation

@abrarsheikh

Copy link
Copy Markdown
Contributor

ClusterNodeInfoCache.update() had three issues:

  1. Sort bug (correctness): sorted(alive_nodes) returns a new sorted list that was immediately discarded. The cached alive_nodes were stored in arbitrary dict-iteration order, violating the documented invariant of deterministic NodeID ordering. Fixed by using alive_nodes.sort().

  2. Redundant GCS RPC: ray._private.state.available_resources_per_node() goes through the legacy GlobalStateAccessor path, which opens a second connection to GCS and performs a full protobuf serialize/deserialize round-trip per node — every tick. Replaced with self._gcs_client.get_all_resource_usage() which reuses the existing GcsClient and extracts resources_available from the ResourcesData batch.

  3. Unconditional cache rebuild: Node labels and resources_total are static per-node properties that don't change while a node stays alive, yet they were rebuilt from scratch every tick. Added change detection via a frozenset of alive node IDs — labels and total resources are only rebuilt when cluster membership actually changes.

Related to #60680

Signed-off-by: abrar <abrar@anyscale.com>
@abrarsheikh abrarsheikh requested a review from a team as a code owner February 8, 2026 03:14

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces several valuable improvements to ClusterNodeInfoCache.update(). It correctly fixes a sorting bug, and significantly optimizes performance by reducing redundant GCS RPCs and avoiding unconditional cache rebuilds. The implementation is clear and robust, with good error handling for GCS communication failures. I have one minor suggestion to make a part of the new logic more concise and Pythonic.

Comment thread python/ray/serve/_private/cluster_node_info_cache.py Outdated
Signed-off-by: abrar <abrar@anyscale.com>
@abrarsheikh abrarsheikh added the go add ONLY when ready to merge, run all tests label Feb 8, 2026
Signed-off-by: abrar <abrar@anyscale.com>
@abrarsheikh abrarsheikh merged commit 641d4e5 into master Feb 9, 2026
6 checks passed
@abrarsheikh abrarsheikh deleted the 60680-abrar-gcs branch February 9, 2026 17:55
ans9868 pushed a commit to ans9868/ray that referenced this pull request Feb 18, 2026
…project#60843)

`ClusterNodeInfoCache.update()` had three issues:

1. **Sort bug (correctness):** `sorted(alive_nodes)` returns a new
sorted list that was immediately discarded. The cached `alive_nodes`
were stored in arbitrary dict-iteration order, violating the documented
invariant of deterministic NodeID ordering. Fixed by using
`alive_nodes.sort()`.

2. **Redundant GCS RPC:**
`ray._private.state.available_resources_per_node()` goes through the
legacy `GlobalStateAccessor` path, which opens a second connection to
GCS and performs a full protobuf serialize/deserialize round-trip per
node — every tick. Replaced with
`self._gcs_client.get_all_resource_usage()` which reuses the existing
`GcsClient` and extracts `resources_available` from the `ResourcesData`
batch.

3. **Unconditional cache rebuild:** Node labels and `resources_total`
are static per-node properties that don't change while a node stays
alive, yet they were rebuilt from scratch every tick. Added change
detection via a `frozenset` of alive node IDs — labels and total
resources are only rebuilt when cluster membership actually changes.

Related to ray-project#60680

---------

Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: Adel Nour <ans9868@nyu.edu>
Aydin-ab pushed a commit to kunling-anyscale/ray that referenced this pull request Feb 20, 2026
…project#60843)

`ClusterNodeInfoCache.update()` had three issues:

1. **Sort bug (correctness):** `sorted(alive_nodes)` returns a new
sorted list that was immediately discarded. The cached `alive_nodes`
were stored in arbitrary dict-iteration order, violating the documented
invariant of deterministic NodeID ordering. Fixed by using
`alive_nodes.sort()`.

2. **Redundant GCS RPC:**
`ray._private.state.available_resources_per_node()` goes through the
legacy `GlobalStateAccessor` path, which opens a second connection to
GCS and performs a full protobuf serialize/deserialize round-trip per
node — every tick. Replaced with
`self._gcs_client.get_all_resource_usage()` which reuses the existing
`GcsClient` and extracts `resources_available` from the `ResourcesData`
batch.

3. **Unconditional cache rebuild:** Node labels and `resources_total`
are static per-node properties that don't change while a node stays
alive, yet they were rebuilt from scratch every tick. Added change
detection via a `frozenset` of alive node IDs — labels and total
resources are only rebuilt when cluster membership actually changes.

Related to ray-project#60680

---------

Signed-off-by: abrar <abrar@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

2 participants