Skip to content

[Data] Log DataContext at the beginning of execution for configuration traceability#61150

Merged
bveeramani merged 22 commits into
ray-project:masterfrom
rayhhome:print-data-context
Feb 25, 2026
Merged

[Data] Log DataContext at the beginning of execution for configuration traceability#61150
bveeramani merged 22 commits into
ray-project:masterfrom
rayhhome:print-data-context

Conversation

@rayhhome

@rayhhome rayhhome commented Feb 18, 2026

Copy link
Copy Markdown
Contributor

Description

Print out DataContext at the beginning of each dataset execution to make the entire execution configuration traceable.

Additional information

DataContext contains the originally logged ExecutionOptions.

Example log (Ellipses are for readability of this PR, the log will contain full DataContext in actual execution):
DEBUG streaming_executor.py:191 -- Data Context:
{'_checkpoint_config': None,
'_enable_actor_pool_on_exit_hook': False,
'_execution_idx': 0,
'_kv_configs': {},
'_max_num_blocks_in_streaming_gen_buffer': 2,
'_shuffle_strategy': <ShuffleStrategy.HASH_SHUFFLE: 'hash_shuffle'>,
'_task_pool_data_task_remote_args': {},
...
'write_file_retry_on_errors': ('AWS Error INTERNAL_FAILURE',
'AWS Error NETWORK_CONNECTION',
'AWS Error SLOW_DOWN',
'AWS Error UNKNOWN (HTTP status 503)')}

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Copilot AI review requested due to automatic review settings February 18, 2026 22:13
@rayhhome rayhhome requested a review from a team as a code owner February 18, 2026 22:13

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request improves the traceability of dataset executions by logging the ExecutionOptions and DataContext at the beginning of execution. The use of pprint enhances the readability of the logged configurations. My review includes a suggestion to combine the logging statements for better conciseness and to group related information.

Comment thread python/ray/data/_internal/execution/streaming_executor.py Outdated
Comment thread python/ray/data/_internal/execution/streaming_executor.py Outdated

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances execution traceability by logging both the DataContext and ExecutionOptions at the beginning of dataset execution. The logging uses pprint.pformat() to improve readability of the logged configuration objects.

Changes:

  • Added pprint import for formatted output
  • Modified logging to use pprint.pformat(vars(...)) for both ExecutionOptions and DataContext
  • Added new logging statement for DataContext configuration

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread python/ray/data/_internal/execution/streaming_executor.py Outdated
Comment thread python/ray/data/_internal/execution/streaming_executor.py Outdated
@rayhhome rayhhome changed the title Log DataContext at the beginning of execution for configuration traceability Feb 18, 2026
Comment thread python/ray/data/_internal/execution/streaming_executor.py
Comment thread python/ray/data/_internal/execution/streaming_executor.py Outdated
Comment thread python/ray/data/_internal/execution/streaming_executor.py Outdated
Comment thread python/ray/data/_internal/execution/streaming_executor.py Outdated
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
@ray-gardener ray-gardener Bot added the community-contribution Contributed by the community label Feb 19, 2026
@goutamvenkat-anyscale goutamvenkat-anyscale added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels Feb 19, 2026
Comment thread python/ray/data/_internal/execution/streaming_executor.py Outdated

@goutamvenkat-anyscale goutamvenkat-anyscale left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one comment

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Comment thread python/ray/data/_internal/execution/streaming_executor.py
Comment thread python/ray/data/_internal/execution/streaming_executor.py Outdated

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

logger.debug("Execution config: %s", self._options)
# Log the full DataContext for traceability
if logger.isEnabledFor(logging.DEBUG) and log_once(
f"ray_data_log_context_{self._dataset_id}"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log_once key includes execution index causing repeat logging

Medium Severity

The log_once key uses self._dataset_id, which includes the execution index (_run_index) that increments with each execution. This causes the DataContext to be logged on every execution of the same dataset (e.g., in multiple training epochs), rather than once per dataset as intended per the PR discussion. The PR author stated "Went with log once for each dataset based on id", but the implementation logs once per execution instead.

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think logging once per execution is reasonable. Don't want to bikeshed on this

@bveeramani bveeramani merged commit 3956d0d into ray-project:master Feb 25, 2026
6 checks passed
bveeramani pushed a commit that referenced this pull request Mar 7, 2026
## Description
Print Log DataContext in JSON (dictionary) format with
`sanitize_for_struct` instead of with `pprint`.

## Related issues
Improves upon #61150.

## Additional information
**Using `sanitize_for_struct` instead of `json.dump` directly because
the following fields in `DataContext` are not JSON serializable:**
- `execution_options: ExecutionOptions`, which is a regular class;
- `checkpoint_config: Optional[CheckpointConfig]`, which is a regular
class;
- `issue_detectors_config: IssueDetectorsConfiguration`, which contains
a list of class type objects: `List[Type[IssueDetector]]`;
- `scheduling_strategy: SchedulingStrategyT`, which is a `Union`
involving regular classes;
- `custom_execution_callback_classes: List[Type["ExecutionCallback"]]`,
which is a list of class types.

Example log (Ellipses are for readability of this PR, the log will
contain full DataContext in actual execution):
DEBUG streaming_executor.py:197 -- Data Context for dataset dataset_2_0:
{'target_max_block_size': 134217728, 'target_min_block_size': 1048576,
'streaming_read_buffer_size': 33554432, ...}

---------

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
ParagEkbote pushed a commit to ParagEkbote/ray that referenced this pull request Mar 10, 2026
)

## Description
Print Log DataContext in JSON (dictionary) format with
`sanitize_for_struct` instead of with `pprint`.

## Related issues
Improves upon ray-project#61150.

## Additional information
**Using `sanitize_for_struct` instead of `json.dump` directly because
the following fields in `DataContext` are not JSON serializable:**
- `execution_options: ExecutionOptions`, which is a regular class;
- `checkpoint_config: Optional[CheckpointConfig]`, which is a regular
class;
- `issue_detectors_config: IssueDetectorsConfiguration`, which contains
a list of class type objects: `List[Type[IssueDetector]]`;
- `scheduling_strategy: SchedulingStrategyT`, which is a `Union`
involving regular classes;
- `custom_execution_callback_classes: List[Type["ExecutionCallback"]]`,
which is a list of class types.

Example log (Ellipses are for readability of this PR, the log will
contain full DataContext in actual execution):
DEBUG streaming_executor.py:197 -- Data Context for dataset dataset_2_0:
{'target_max_block_size': 134217728, 'target_min_block_size': 1048576,
'streaming_read_buffer_size': 33554432, ...}

---------

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Parag Ekbote <thecoolekbote189@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community data Ray Data-related issues go add ONLY when ready to merge, run all tests

4 participants