[Data] Log DataContext at the beginning of execution for configuration traceability by rayhhome · Pull Request #61150 · ray-project/ray

rayhhome · 2026-02-18T22:13:55Z

Description

Print out DataContext at the beginning of each dataset execution to make the entire execution configuration traceable.

Additional information

DataContext contains the originally logged ExecutionOptions.

Example log (Ellipses are for readability of this PR, the log will contain full DataContext in actual execution):
DEBUG streaming_executor.py:191 -- Data Context:
{'_checkpoint_config': None,
'_enable_actor_pool_on_exit_hook': False,
'_execution_idx': 0,
'_kv_configs': {},
'_max_num_blocks_in_streaming_gen_buffer': 2,
'_shuffle_strategy': <ShuffleStrategy.HASH_SHUFFLE: 'hash_shuffle'>,
'_task_pool_data_task_remote_args': {},
...
'write_file_retry_on_errors': ('AWS Error INTERNAL_FAILURE',
'AWS Error NETWORK_CONNECTION',
'AWS Error SLOW_DOWN',
'AWS Error UNKNOWN (HTTP status 503)')}

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

gemini-code-assist

Code Review

This pull request improves the traceability of dataset executions by logging the ExecutionOptions and DataContext at the beginning of execution. The use of pprint enhances the readability of the logged configurations. My review includes a suggestion to combine the logging statements for better conciseness and to group related information.

Copilot

Pull request overview

This PR enhances execution traceability by logging both the DataContext and ExecutionOptions at the beginning of dataset execution. The logging uses pprint.pformat() to improve readability of the logged configuration objects.

Changes:

Added pprint import for formatted output
Modified logging to use pprint.pformat(vars(...)) for both ExecutionOptions and DataContext
Added new logging statement for DataContext configuration

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

… into print-data-context

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

goutamvenkat-anyscale

Just one comment

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

cursor · 2026-02-24T21:46:28Z

-            logger.debug("Execution config: %s", self._options)
+            # Log the full DataContext for traceability
+            if logger.isEnabledFor(logging.DEBUG) and log_once(
+                f"ray_data_log_context_{self._dataset_id}"


log_once key includes execution index causing repeat logging

Medium Severity

The log_once key uses self._dataset_id, which includes the execution index (_run_index) that increments with each execution. This causes the DataContext to be logged on every execution of the same dataset (e.g., in multiple training epochs), rather than once per dataset as intended per the PR discussion. The PR author stated "Went with log once for each dataset based on id", but the implementation logs once per execution instead.

I think logging once per execution is reasonable. Don't want to bikeshed on this

## Description Print Log DataContext in JSON (dictionary) format with `sanitize_for_struct` instead of with `pprint`. ## Related issues Improves upon #61150. ## Additional information **Using `sanitize_for_struct` instead of `json.dump` directly because the following fields in `DataContext` are not JSON serializable:** - `execution_options: ExecutionOptions`, which is a regular class; - `checkpoint_config: Optional[CheckpointConfig]`, which is a regular class; - `issue_detectors_config: IssueDetectorsConfiguration`, which contains a list of class type objects: `List[Type[IssueDetector]]`; - `scheduling_strategy: SchedulingStrategyT`, which is a `Union` involving regular classes; - `custom_execution_callback_classes: List[Type["ExecutionCallback"]]`, which is a list of class types. Example log (Ellipses are for readability of this PR, the log will contain full DataContext in actual execution): DEBUG streaming_executor.py:197 -- Data Context for dataset dataset_2_0: {'target_max_block_size': 134217728, 'target_min_block_size': 1048576, 'streaming_read_buffer_size': 33554432, ...} --------- Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

) ## Description Print Log DataContext in JSON (dictionary) format with `sanitize_for_struct` instead of with `pprint`. ## Related issues Improves upon ray-project#61150. ## Additional information **Using `sanitize_for_struct` instead of `json.dump` directly because the following fields in `DataContext` are not JSON serializable:** - `execution_options: ExecutionOptions`, which is a regular class; - `checkpoint_config: Optional[CheckpointConfig]`, which is a regular class; - `issue_detectors_config: IssueDetectorsConfiguration`, which contains a list of class type objects: `List[Type[IssueDetector]]`; - `scheduling_strategy: SchedulingStrategyT`, which is a `Union` involving regular classes; - `custom_execution_callback_classes: List[Type["ExecutionCallback"]]`, which is a list of class types. Example log (Ellipses are for readability of this PR, the log will contain full DataContext in actual execution): DEBUG streaming_executor.py:197 -- Data Context for dataset dataset_2_0: {'target_max_block_size': 134217728, 'target_min_block_size': 1048576, 'streaming_read_buffer_size': 33554432, ...} --------- Signed-off-by: Sirui Huang <ray.huang@anyscale.com> Signed-off-by: Parag Ekbote <thecoolekbote189@gmail.com>

rayhhome added 6 commits February 17, 2026 15:57

Preliminary change to print data context

adea3a9

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Merge remote-tracking branch 'origin/master' into print-data-context

800a0ff

Prettier printed message

fbab811

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Merge branch 'master' into print-data-context

6121e4c

Change message level from info to debug

5c4bc77

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Fix message level mistake

24d5479

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Copilot AI review requested due to automatic review settings February 18, 2026 22:13

rayhhome requested a review from a team as a code owner February 18, 2026 22:13

Copilot started reviewing on behalf of rayhhome February 18, 2026 22:14 View session

gemini-code-assist Bot reviewed Feb 18, 2026

View reviewed changes

Comment thread python/ray/data/_internal/execution/streaming_executor.py Outdated

cursor Bot reviewed Feb 18, 2026

View reviewed changes

Comment thread python/ray/data/_internal/execution/streaming_executor.py Outdated

Copilot AI reviewed Feb 18, 2026

View reviewed changes

Comment thread python/ray/data/_internal/execution/streaming_executor.py Outdated

Comment thread python/ray/data/_internal/execution/streaming_executor.py Outdated

rayhhome changed the title ~~Log DataContext at the beginning of execution for configuration traceability~~ Feb 18, 2026

rayhhome added 2 commits February 18, 2026 16:29

Clearer and more efficient log message

08da0ee

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Merge branch 'master' into print-data-context

2b26e6e

cursor Bot reviewed Feb 19, 2026

View reviewed changes

Comment thread python/ray/data/_internal/execution/streaming_executor.py

Comment thread python/ray/data/_internal/execution/streaming_executor.py Outdated

rayhhome added 2 commits February 18, 2026 16:52

Remove redundant ExecutionOptions message + use dataContext repr

9ef9cc1

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Merge remote-tracking branch 'refs/remotes/origin/print-data-context'…

9a0ceca

… into print-data-context

cursor Bot reviewed Feb 19, 2026

View reviewed changes

Comment thread python/ray/data/_internal/execution/streaming_executor.py Outdated

Comment thread python/ray/data/_internal/execution/streaming_executor.py Outdated

Change log level

bb62e3b

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

ray-gardener Bot added the community-contribution Contributed by the community label Feb 19, 2026

Merge branch 'master' into print-data-context

c459741

goutamvenkat-anyscale added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels Feb 19, 2026

goutamvenkat-anyscale reviewed Feb 19, 2026

View reviewed changes

Comment thread python/ray/data/_internal/execution/streaming_executor.py Outdated

goutamvenkat-anyscale approved these changes Feb 19, 2026

View reviewed changes

rayhhome added 3 commits February 19, 2026 16:29

Use log_once for Data Context

b17c1fc

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Log once no longer trigger on each dataset

dfc7202

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Merge branch 'master' into print-data-context

a839653

cursor Bot reviewed Feb 20, 2026

View reviewed changes

Comment thread python/ray/data/_internal/execution/streaming_executor.py

rayhhome added 3 commits February 23, 2026 10:00

Merge branch 'master' into print-data-context

1cbdc7a

Merge branch 'master' into print-data-context

a574e5c

Merge branch 'master' into print-data-context

10ff713

bveeramani reviewed Feb 23, 2026

View reviewed changes

Comment thread python/ray/data/_internal/execution/streaming_executor.py Outdated

rayhhome added 4 commits February 23, 2026 15:00

Log once for each dataset based on id

f12da12

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Merge branch 'master' into print-data-context

677b5d3

Merge branch 'master' into print-data-context

ca832e8

Merge branch 'master' into print-data-context

efa06cf

cursor Bot reviewed Feb 24, 2026

View reviewed changes

bveeramani approved these changes Feb 25, 2026

View reviewed changes

bveeramani merged commit 3956d0d into ray-project:master Feb 25, 2026
6 checks passed

rayhhome mentioned this pull request Mar 2, 2026

[Data] Print data context in JSON (Dictionary) Format #61428

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data] Log DataContext at the beginning of execution for configuration traceability#61150

[Data] Log DataContext at the beginning of execution for configuration traceability#61150
bveeramani merged 22 commits into
ray-project:masterfrom
rayhhome:print-data-context

rayhhome commented Feb 18, 2026 •

edited

Loading

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

goutamvenkat-anyscale left a comment

cursor Bot left a comment

Uh oh!

Uh oh!

cursor Bot left a comment

cursor Bot Feb 24, 2026

bveeramani Feb 25, 2026

Uh oh!

Labels

4 participants

Uh oh!

Conversation

rayhhome commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Additional information

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

goutamvenkat-anyscale left a comment

Choose a reason for hiding this comment

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

cursor Bot Feb 24, 2026

Choose a reason for hiding this comment

log_once key includes execution index causing repeat logging

bveeramani Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

Labels

4 participants

rayhhome commented Feb 18, 2026 •

edited

Loading