[data] Support multiple datasets in a cluster (1/2): Pipe DataContext.ExecutionOptions.label_selector to task submissions by TimothySeah · Pull Request #63331 · ray-project/ray

TimothySeah · 2026-05-14T00:31:14Z

Summary

The end goal is to support running 2 ray data datasets in 1 cluster with subcluster label scheduling. To that end, I will create the following PR stack:

This PR: Allow users to set dataset-level label_selector through DataContext.execution_options.label_selector.
Followup PR: AutoscalingCoordinator _tick loop should respect subcluster labels when scaling up the cluster and allocating resources to datasets.

In the same way that the executor pulls scheduling_strategy from the DataContext before submitting certain Ray Data tasks (

ray/python/ray/data/_internal/execution/operators/map_operator.py

Line 470 in 1fd5697

if "scheduling_strategy" not in ray_remote_args:

), after this PR, the executor will do the same for label_selector. Note that label_selector is in INHERITABLE_REMOTE_ARGS so operator fusion will respect it.

I piped the labels through the following types of callsites. Let me know if I'm missing any and/or if any are unnecessary/problematic.

Map operators. These already merge other values (like scheduling_strategy) from the Dataset or Operator to ray_remote_args.
Various operators (like LimitOperator and ZipOperator) that call .remote(). These places now pipe label_selector through.
Exchange schedulers (sort, repartition, aggregation, random_shuffle). These places now pipe label_selector through.
Planning-time tasks (file metadata fetch, parquet sampling)
Construction-time tasks (from_pandas/numpy/arrow)
Conversion-time tasks (to_pandas/numpy/arrow, block-num-rows)
Niche features (RandomAccess, checkpoint)

Note that I may have missed piping label_selector through other paths, but this is no worse than today's behavior and we can close these gaps when we encounter them. I intentionally did not close the following gaps in this PR:

_PushBasedShuffleStage uses NodeAffinitySchedulingStrategy instead of the dataset's label_selector. In practice this is fine since the blocks should already be in-subcluster from the preceding steps.
Multiple datasinks fan out nested .remote() calls from within an already-placed wrte task. Fixing may require reading the DataContext from inside the task.

Testing

Unit tests. I will do more e2e testing in the followup PR. I unit tested some but not all callsites since many of them are deeply entangled and would require a lot of mocking; at the very least existing unit tests should verify that piping an empty label_selector through doesn't break anything.

I cherrypicked this PR and #63375 into a different PR and ran an async checkpointing and validation benchmark. I dove into the training and validation datasets and confirmed that all of their tasks were placed on the correct subcluster.

Alternative Considered

With this PR, the API for interleaved validation is simple - the user can simply set training and validation dataset execution options in the dataset_config:

dataset_config = ray.train.DataConfig(
    datasets_to_split=["train", "test"],
    execution_options={
        "train": ExecutionOptions(label_selector={"subcluster": "train"}),
        "test": ExecutionOptions(label_selector={"subcluster": "validation"}),
    },
)

However, the API for async validation is a bit more confusing because we do not split the validation dataset in the overall trainer and therefore need to set it within the validation function itself:

dataset_config = ray.train.DataConfig(
    datasets_to_split=["train"],
    execution_options={
        "train": ExecutionOptions(label_selector={"subcluster": "train"}),
    },
)

...
def validate_fn(checkpoint):
    validation_dataset = ...
    validation_dataset.context.execution_options.label_selector = {
        "subcluster": "validation"
    }
    ...

One alternative @justinvyu and I discussed is to make DataConfig the sole public API even in the async validation case. However, this would mean that 1) we need to pass the dataset around from the driver to the trainer to the validator as opposed to just creating it in the validator 2) we need to remember not to split the validation dataset up front 3) we need a way for the validation function to get the un-split Dataset (as opposed to the DataIterator)

dataset_config = ray.train.DataConfig(
    datasets_to_split=["train"],
    execution_options={
        "train": ExecutionOptions(label_selector={"subcluster": "train"}),
        "test": ExecutionOptions(label_selector={"subcluster": "validation"}),
    },
)

def validate_fn(checkpoint, validation_dataset):
    validation_datset.map_batches(...)
 

def train_fn():
    ....
    validation_dataset = ray.train.get_dataset_shard("train", split=False)
    ray.train.report(..., validate_fn=validate_fn, validation=ValidationTaskConfig(validation_dataset=validation_dataset)

Ultimately we decided that this PR is fine for now since modifying execution_options is already a mostly recommended public API right now anyway. Though this alternate approach is more "public facing," it moves more data around and might actually be even more confusing.

…ssions Signed-off-by: Timothy Seah <tseah@anyscale.com>

gemini-code-assist

Code Review

This pull request introduces a label_selector to ExecutionOptions, allowing users to constrain Ray Data tasks and actors to specific nodes within a cluster. The implementation propagates this selector across various components, including data sources, execution operators (Map, Shuffle, Limit, Zip), and planners (Aggregate, Sort, Repartition). A new utility, merge_label_selector, is added to handle merging context-level selectors with operator-level arguments while ensuring that existing node-pin selectors are preserved. Review feedback identified critical issues where method signatures in SplitRepartitionTaskScheduler were not updated to accept new arguments, which would lead to TypeError and NameError at runtime. Additionally, a suggestion was made to improve the robustness of the merging utility by handling null inputs for remote arguments.

Signed-off-by: Timothy Seah <tseah@anyscale.com>

justinvyu

The changes make sense thanks for being so thorough and catching all the cases where operators spawn tasks outside of the typical maps.

One concern I have is that any future modifications to operators / new operators will need to remember to add these label options. One idea I had was to automatically merge the label in cached_remove_fn. But then you'd need to fetch the global DataContext.get_current() which might not be consistent with the current dataset's context (ex: if you set it differently per dataset). So we can just go with the current way where we explicitly pass the arg into every remote call.

Also, could you make a note of the places where we pull the label selector from DataContext.get_current()? Can this cause issues?

DataContext.get_current().label_selector = "A"
ds = ray.data.read_parquet().map()

DataContext.get_current().label_selector = "B"
ds.materialize()  # <-- some places will use "B" and some will use "A"?

…t.get_current() Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah · 2026-05-22T02:04:34Z

One concern I have is that any future modifications to operators / new operators will need to remember to add these label options.

I agree this isn't ideal, but forgetting to do this isn't that bad since we would just have a bit of resource leakage, which is no worse than what would happen today. Is there a good data test I can run to test this? I was thinking of running an "interesting" data workload and verifying that all the tasks that were created landed on the right subcluster.

Also, could you make a note of the places where we pull the label selector from DataContext.get_current()? Can this cause issues?
DataContext.get_current().label_selector = "A"
ds = ray.data.read_parquet().map()

DataContext.get_current().label_selector = "B"
ds.materialize()  # <-- some places will use "B" and some will use "A"?

Good callout. This PR adds 5 places that pull the label selector from DataContext.get_current(). 4 of them happen at dataset construction time (parquet_datasource.py, file_meta_provider.py, random_access_dataset.py, read_api.py) so all of them would use A in your example. The Ray Train DataConfig.execution_options path that we agreed to recommend to users unfortunately won't be respected here. It's not great but I think we can either:

Tell users to set the DataContext.get_current().execution_options before constructing each dataset.
Do nothing. Iiuc these tasks are relatively lightweight so they won't mess up backpressure.
Pipe label_selector through these "construction" paths as well.

Open to suggestions here.

The 5th place is CheckpointManager.load_checkpoint/CheckpointManager._clean_pending_checkpoints. I changed these to use the DataContext passed from the Dataset that created them instead of DataContext.get_current(), which should close the mismatch here.

justinvyu · 2026-05-27T16:10:17Z

Is there a good data test I can run to test this?

heterogeneous_memory_batch_inference would be good to test (multiple node types).

Tell users to set the DataContext.get_current().execution_options before constructing each dataset.

Gotcha, let's just make sure to publicly document these caveats.

Kind of related question: how will DataConfig.execution_options configure async validation datasets?

Ex:

train_ds = ray.data.read_parquet(...)
valid_ds = ray.data.read_parquet(...)

# We don't actually pass `valid_ds` through `TorchTrainer`
TorchTrainer(train_loop_config={"valid_ds": valid_ds}, datasets={"train": train_ds}, dataset_config=DataConfig(execution_options={"train": ...}))

def valid_fn(valid_ds):
    # Does the user need to actually use the DataContext API?
    valid_ds.context.label_selector = ...
    valid_ds.materialize()

def train_fn_per_worker(config):
    ray.train.report(..., valid_fn, config["valid_ds"])

TimothySeah · 2026-05-27T20:06:53Z

Kind of related question: how will DataConfig.execution_options configure async validation datasets?

Ex:

train_ds = ray.data.read_parquet(...)
valid_ds = ray.data.read_parquet(...)

# We don't actually pass `valid_ds` through `TorchTrainer`
TorchTrainer(train_loop_config={"valid_ds": valid_ds}, datasets={"train": train_ds}, dataset_config=DataConfig(execution_options={"train": ...}))

def valid_fn(valid_ds):
    # Does the user need to actually use the DataContext API?
    valid_ds.context.label_selector = ...
    valid_ds.materialize()

def train_fn_per_worker(config):
    ray.train.report(..., valid_fn, config["valid_ds"])

Yeah the user would need to set valid_ds.context.execution_options in the driver or in the validation function.

justinvyu

Can you call out the alternative API proposal we discussed? Ok to go with modifying the context since we already recommend that as a sort of public API. Let's label this somehow as an Alpha configuration.

TimothySeah · 2026-05-29T00:23:39Z

Try to centralize task options rather than piping it everywhere. This can be followup for other global settings that we use e.g. scheduling_strategy
Raise error if the label selector conflicts with execution options e.g. if map_batches sets something different from what is on the base dataset. This might also be easier if we do 1) first.

I already filed a bug for the above.

TimothySeah · 2026-05-29T19:14:36Z

Can you call out the alternative API proposal we discussed? Ok to go with modifying the context since we already recommend that as a sort of public API. Let's label this somehow as an Alpha configuration.

Added to PR description.

…prototype

….ExecutionOptions.label_selector to task submissions (ray-project#63331) The end goal is to support running 2 ray data datasets in 1 cluster with subcluster label scheduling. To that end, I will create the following PR stack: 1) This PR: Allow users to set dataset-level `label_selector` through `DataContext.execution_options.label_selector`. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>

…r resources by subcluster label (#63375) The end goal is to support 2 ray data datasets in 1 cluster with subcluster label scheduling. In such a setup, we have 2 datasets sharing the same AutoscalingCoordinator. The previous PR in this stack (#63331) made sure that each dataset's tasks ended up in the correct subcluster. This PR ensures that all requesters, whether they are trainers or datasets, only request and receive resources in their subcluster. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: Justin Yu <justin.v.yu@gmail.com>

…r resources by subcluster label (ray-project#63375) The end goal is to support 2 ray data datasets in 1 cluster with subcluster label scheduling. In such a setup, we have 2 datasets sharing the same AutoscalingCoordinator. The previous PR in this stack (ray-project#63331) made sure that each dataset's tasks ended up in the correct subcluster. This PR ensures that all requesters, whether they are trainers or datasets, only request and receive resources in their subcluster. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: Justin Yu <justin.v.yu@gmail.com>

….ExecutionOptions.label_selector to task submissions (ray-project#63331) The end goal is to support running 2 ray data datasets in 1 cluster with subcluster label scheduling. To that end, I will create the following PR stack: 1) This PR: Allow users to set dataset-level `label_selector` through `DataContext.execution_options.label_selector`. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>

…r resources by subcluster label (ray-project#63375) The end goal is to support 2 ray data datasets in 1 cluster with subcluster label scheduling. In such a setup, we have 2 datasets sharing the same AutoscalingCoordinator. The previous PR in this stack (ray-project#63331) made sure that each dataset's tasks ended up in the correct subcluster. This PR ensures that all requesters, whether they are trainers or datasets, only request and receive resources in their subcluster. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: Justin Yu <justin.v.yu@gmail.com>

[data] Pipe DataContext.ExecutionOptions.label_selector to task submi…

d934d15

…ssions Signed-off-by: Timothy Seah <tseah@anyscale.com>

gemini-code-assist Bot reviewed May 14, 2026

View reviewed changes

Comment thread python/ray/data/_internal/planner/exchange/split_repartition_task_scheduler.py

Comment thread python/ray/data/_internal/planner/repartition.py

Comment thread python/ray/data/_internal/execution/util.py

pipe label_selector through other ray data tasks

5edded7

Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah marked this pull request as ready for review May 14, 2026 21:02

TimothySeah requested a review from a team as a code owner May 14, 2026 21:02

ray-gardener Bot added the data Ray Data-related issues label May 15, 2026

TimothySeah mentioned this pull request May 15, 2026

[data] Support multiple datasets in a cluster (2/2): partition cluster resources by subcluster label #63375

Merged

justinvyu reviewed May 20, 2026

View reviewed changes

CheckpointManager should use passed DataContext instead of DataContex…

30cf68f

…t.get_current() Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah requested a review from justinvyu May 27, 2026 00:35

justinvyu approved these changes May 28, 2026

View reviewed changes

TimothySeah changed the title ~~[data] Pipe DataContext.ExecutionOptions.label_selector to task submissions~~ May 29, 2026

justinvyu enabled auto-merge (squash) May 29, 2026 22:13

github-actions Bot added the go add ONLY when ready to merge, run all tests label May 29, 2026

Merge remote-tracking branch 'upstream/master' into tseah/2-datasets-…

414d081

…prototype

github-actions Bot disabled auto-merge May 29, 2026 23:00

justinvyu enabled auto-merge (squash) May 29, 2026 23:10

justinvyu merged commit 750ef4e into ray-project:master May 29, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[data] Support multiple datasets in a cluster (1/2): Pipe DataContext.ExecutionOptions.label_selector to task submissions#63331

[data] Support multiple datasets in a cluster (1/2): Pipe DataContext.ExecutionOptions.label_selector to task submissions#63331
justinvyu merged 4 commits into
ray-project:masterfrom
TimothySeah:tseah/2-datasets-prototype

TimothySeah commented May 14, 2026 •

edited

Loading

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

justinvyu left a comment

TimothySeah commented May 22, 2026 •

edited

Loading

justinvyu commented May 27, 2026

TimothySeah commented May 27, 2026

justinvyu left a comment •

edited

Loading

TimothySeah commented May 29, 2026 •

edited

Loading

TimothySeah commented May 29, 2026

Uh oh!

Labels

2 participants

Uh oh!

Conversation

TimothySeah commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Alternative Considered

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

TimothySeah commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

justinvyu commented May 27, 2026

TimothySeah commented May 27, 2026

justinvyu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

TimothySeah commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TimothySeah commented May 29, 2026

Uh oh!

Labels

2 participants

TimothySeah commented May 14, 2026 •

edited

Loading

TimothySeah commented May 22, 2026 •

edited

Loading

justinvyu left a comment •

edited

Loading

TimothySeah commented May 29, 2026 •

edited

Loading