Skip to content

[Data] Print data context in JSON (Dictionary) Format#61428

Merged
bveeramani merged 36 commits into
ray-project:masterfrom
rayhhome:print-data-context
Mar 7, 2026
Merged

[Data] Print data context in JSON (Dictionary) Format#61428
bveeramani merged 36 commits into
ray-project:masterfrom
rayhhome:print-data-context

Conversation

@rayhhome

@rayhhome rayhhome commented Mar 2, 2026

Copy link
Copy Markdown
Contributor

Description

Print Log DataContext in JSON (dictionary) format with sanitize_for_struct instead of with pprint.

Related issues

Improves upon #61150.

Additional information

Using sanitize_for_struct instead of json.dump directly because the following fields in DataContext are not JSON serializable:

  • execution_options: ExecutionOptions, which is a regular class;
  • checkpoint_config: Optional[CheckpointConfig], which is a regular class;
  • issue_detectors_config: IssueDetectorsConfiguration, which contains a list of class type objects: List[Type[IssueDetector]];
  • scheduling_strategy: SchedulingStrategyT, which is a Union involving regular classes;
  • custom_execution_callback_classes: List[Type["ExecutionCallback"]], which is a list of class types.

Example log (Ellipses are for readability of this PR, the log will contain full DataContext in actual execution):
DEBUG streaming_executor.py:197 -- Data Context for dataset dataset_2_0:
{'target_max_block_size': 134217728, 'target_min_block_size': 1048576, 'streaming_read_buffer_size': 33554432, ...}

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
@rayhhome rayhhome self-assigned this Mar 2, 2026
Copilot AI review requested due to automatic review settings March 2, 2026 21:52
@rayhhome rayhhome requested a review from a team as a code owner March 2, 2026 21:52
@rayhhome rayhhome added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels Mar 2, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves upon #61150, which introduced logging of DataContext at the start of dataset execution. The change replaces pprint.pformat(self._data_context) with asdict(self._data_context) when formatting the debug log message, with the goal of producing more structured (dictionary-based) output.

Changes:

  • Removes import pprint and adds from dataclasses import asdict
  • Replaces the pprint.pformat(...) call with asdict(...) when logging DataContext

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread python/ray/data/_internal/execution/streaming_executor.py Outdated
Comment thread python/ray/data/_internal/execution/streaming_executor.py Outdated

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request changes the logging of DataContext from using pprint.pformat to dataclasses.asdict. While this is a good simplification for a dataclass, the resulting log output will be a single-line string representation of a dictionary, which can be hard to read for large contexts and isn't valid JSON as stated in the PR description. My feedback suggests using the json module to produce a properly formatted and readable JSON string, which aligns better with the PR's intent and improves log readability.

Comment thread python/ray/data/_internal/execution/streaming_executor.py Outdated
Comment thread python/ray/data/_internal/execution/streaming_executor.py Outdated
@rayhhome rayhhome changed the title Print data context Mar 2, 2026

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Comment thread python/ray/data/_internal/execution/streaming_executor.py Outdated
logger.debug(
f"Data Context for dataset {self._dataset_id}:\n%s",
pprint.pformat(self._data_context),
sanitize_for_struct(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this truncate nesting inside the datacontext, or does it truncate the whole context? I believe we want to log the full context and not lose keys

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might truncate nesting inside the datacontext, e.g.

'execution_options': 'ExecutionOptions(resource_limits=ExecutionResources(cpu=inf, gpu=inf, object_store_memory=inf, memor...'

I tried to mitigate this issue by setting truncate_length to DATA_CONTEXT_LOG_TRUNCATE_LENGTH. My testing shows that none of the fields would be truncated in the debug output with the current configuration (log.txt)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 @rayhhome I have the same question

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the current truncate_length passed into (which is DATA_CONTEXT_LOG_TRUNCATE_LENGTH, i.e. 10000), the DataContext will not be truncated unless there's any string with more than 10000 characters or any list with more than 10000 elements, which I believe is unlikely.

Currently, calling json.dumps directly on DataContext raises a exception because there are a few fields in DataContext that are not JSON serializable (I've documented such fields in the PR description). I can use _json_default to get around this though, do we want to switch to using json.dumps instead of sanitize_for_struct?

@bveeramani bveeramani merged commit 598ca8d into ray-project:master Mar 7, 2026
5 of 6 checks passed
ParagEkbote pushed a commit to ParagEkbote/ray that referenced this pull request Mar 10, 2026
)

## Description
Print Log DataContext in JSON (dictionary) format with
`sanitize_for_struct` instead of with `pprint`.

## Related issues
Improves upon ray-project#61150.

## Additional information
**Using `sanitize_for_struct` instead of `json.dump` directly because
the following fields in `DataContext` are not JSON serializable:**
- `execution_options: ExecutionOptions`, which is a regular class;
- `checkpoint_config: Optional[CheckpointConfig]`, which is a regular
class;
- `issue_detectors_config: IssueDetectorsConfiguration`, which contains
a list of class type objects: `List[Type[IssueDetector]]`;
- `scheduling_strategy: SchedulingStrategyT`, which is a `Union`
involving regular classes;
- `custom_execution_callback_classes: List[Type["ExecutionCallback"]]`,
which is a list of class types.

Example log (Ellipses are for readability of this PR, the log will
contain full DataContext in actual execution):
DEBUG streaming_executor.py:197 -- Data Context for dataset dataset_2_0:
{'target_max_block_size': 134217728, 'target_min_block_size': 1048576,
'streaming_read_buffer_size': 33554432, ...}

---------

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Parag Ekbote <thecoolekbote189@gmail.com>
@rayhhome rayhhome deleted the print-data-context branch March 10, 2026 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

4 participants