[Data] Print data context in JSON (Dictionary) Format by rayhhome · Pull Request #61428 · ray-project/ray

rayhhome · 2026-03-02T21:52:19Z

Description

Print Log DataContext in JSON (dictionary) format with sanitize_for_struct instead of with pprint.

Related issues

Improves upon #61150.

Additional information

Using sanitize_for_struct instead of json.dump directly because the following fields in DataContext are not JSON serializable:

execution_options: ExecutionOptions, which is a regular class;
checkpoint_config: Optional[CheckpointConfig], which is a regular class;
issue_detectors_config: IssueDetectorsConfiguration, which contains a list of class type objects: List[Type[IssueDetector]];
scheduling_strategy: SchedulingStrategyT, which is a Union involving regular classes;
custom_execution_callback_classes: List[Type["ExecutionCallback"]], which is a list of class types.

Example log (Ellipses are for readability of this PR, the log will contain full DataContext in actual execution):
DEBUG streaming_executor.py:197 -- Data Context for dataset dataset_2_0:
{'target_max_block_size': 134217728, 'target_min_block_size': 1048576, 'streaming_read_buffer_size': 33554432, ...}

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

… into print-data-context

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Copilot

Pull request overview

This PR improves upon #61150, which introduced logging of DataContext at the start of dataset execution. The change replaces pprint.pformat(self._data_context) with asdict(self._data_context) when formatting the debug log message, with the goal of producing more structured (dictionary-based) output.

Changes:

Removes import pprint and adds from dataclasses import asdict
Replaces the pprint.pformat(...) call with asdict(...) when logging DataContext

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

gemini-code-assist

Code Review

This pull request changes the logging of DataContext from using pprint.pformat to dataclasses.asdict. While this is a good simplification for a dataclass, the resulting log output will be a single-line string representation of a dictionary, which can be hard to read for large contexts and isn't valid JSON as stated in the PR description. My feedback suggests using the json module to produce a properly formatted and readable JSON string, which aligns better with the PR's intent and improves log readability.

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

…nt-data-context

goutamvenkat-anyscale · 2026-03-03T01:30:58Z

                logger.debug(
                    f"Data Context for dataset {self._dataset_id}:\n%s",
-                    pprint.pformat(self._data_context),
+                    sanitize_for_struct(


will this truncate nesting inside the datacontext, or does it truncate the whole context? I believe we want to log the full context and not lose keys

This might truncate nesting inside the datacontext, e.g.

'execution_options': 'ExecutionOptions(resource_limits=ExecutionResources(cpu=inf, gpu=inf, object_store_memory=inf, memor...'

I tried to mitigate this issue by setting truncate_length to DATA_CONTEXT_LOG_TRUNCATE_LENGTH. My testing shows that none of the fields would be truncated in the debug output with the current configuration (log.txt)

+1 @rayhhome I have the same question

Based on the current truncate_length passed into (which is DATA_CONTEXT_LOG_TRUNCATE_LENGTH, i.e. 10000), the DataContext will not be truncated unless there's any string with more than 10000 characters or any list with more than 10000 elements, which I believe is unlikely.

Currently, calling json.dumps directly on DataContext raises a exception because there are a few fields in DataContext that are not JSON serializable (I've documented such fields in the PR description). I can use _json_default to get around this though, do we want to switch to using json.dumps instead of sanitize_for_struct?

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

) ## Description Print Log DataContext in JSON (dictionary) format with `sanitize_for_struct` instead of with `pprint`. ## Related issues Improves upon ray-project#61150. ## Additional information **Using `sanitize_for_struct` instead of `json.dump` directly because the following fields in `DataContext` are not JSON serializable:** - `execution_options: ExecutionOptions`, which is a regular class; - `checkpoint_config: Optional[CheckpointConfig]`, which is a regular class; - `issue_detectors_config: IssueDetectorsConfiguration`, which contains a list of class type objects: `List[Type[IssueDetector]]`; - `scheduling_strategy: SchedulingStrategyT`, which is a `Union` involving regular classes; - `custom_execution_callback_classes: List[Type["ExecutionCallback"]]`, which is a list of class types. Example log (Ellipses are for readability of this PR, the log will contain full DataContext in actual execution): DEBUG streaming_executor.py:197 -- Data Context for dataset dataset_2_0: {'target_max_block_size': 134217728, 'target_min_block_size': 1048576, 'streaming_read_buffer_size': 33554432, ...} --------- Signed-off-by: Sirui Huang <ray.huang@anyscale.com> Signed-off-by: Parag Ekbote <thecoolekbote189@gmail.com>

rayhhome added 24 commits February 17, 2026 15:57

Preliminary change to print data context

adea3a9

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Merge remote-tracking branch 'origin/master' into print-data-context

800a0ff

Prettier printed message

fbab811

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Merge branch 'master' into print-data-context

6121e4c

Change message level from info to debug

5c4bc77

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Fix message level mistake

24d5479

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Clearer and more efficient log message

08da0ee

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Merge branch 'master' into print-data-context

2b26e6e

Remove redundant ExecutionOptions message + use dataContext repr

9ef9cc1

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Merge remote-tracking branch 'refs/remotes/origin/print-data-context'…

9a0ceca

… into print-data-context

Change log level

bb62e3b

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Merge branch 'master' into print-data-context

c459741

Use log_once for Data Context

b17c1fc

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Log once no longer trigger on each dataset

dfc7202

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Merge branch 'master' into print-data-context

a839653

Merge branch 'master' into print-data-context

1cbdc7a

Merge branch 'master' into print-data-context

a574e5c

Merge branch 'master' into print-data-context

10ff713

Log once for each dataset based on id

f12da12

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Merge branch 'master' into print-data-context

677b5d3

Merge branch 'master' into print-data-context

ca832e8

Merge branch 'master' into print-data-context

efa06cf

Merge remote-tracking branch 'origin' into print-data-context

c1e4300

Log DataContext in json format

3efde5f

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

rayhhome self-assigned this Mar 2, 2026

Copilot AI review requested due to automatic review settings March 2, 2026 21:52

rayhhome requested a review from a team as a code owner March 2, 2026 21:52

rayhhome added data Ray Data-related issues go add ONLY when ready to merge, run all tests labels Mar 2, 2026

Merge branch 'master' into print-data-context

0e8ce54

Copilot started reviewing on behalf of rayhhome March 2, 2026 21:53 View session

Copilot AI reviewed Mar 2, 2026

View reviewed changes

Comment thread python/ray/data/_internal/execution/streaming_executor.py Outdated

Comment thread python/ray/data/_internal/execution/streaming_executor.py Outdated

gemini-code-assist Bot reviewed Mar 2, 2026

View reviewed changes

Comment thread python/ray/data/_internal/execution/streaming_executor.py Outdated

Comment thread python/ray/data/_internal/execution/streaming_executor.py Outdated

rayhhome added 2 commits March 2, 2026 14:10

Use sanitize_for_struct for formatting instead

39bf7ca

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Merge branch 'master' into print-data-context

d861083

rayhhome changed the title ~~Print data context~~ Mar 2, 2026

cursor Bot reviewed Mar 2, 2026

View reviewed changes

Comment thread python/ray/data/_internal/execution/streaming_executor.py Outdated

rayhhome added 2 commits March 2, 2026 17:02

Increase truncate length to avoid truncation

6fb4c51

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Merge branch 'print-data-context' of github.com:rayhhome/ray into pri…

6566152

…nt-data-context

goutamvenkat-anyscale reviewed Mar 3, 2026

View reviewed changes

rayhhome added 5 commits March 3, 2026 10:58

Merge branch 'master' into print-data-context

0c4194e

Merge branch 'master' into print-data-context

b6481ac

Merge branch 'master' into print-data-context

01c072c

Adding to the truncation length macro to ensure fully logged datacontext

d08d1d5

Signed-off-by: Sirui Huang <ray.huang@anyscale.com>

Merge branch 'master' into print-data-context

4a9c381

goutamvenkat-anyscale approved these changes Mar 5, 2026

View reviewed changes

rayhhome added 2 commits March 6, 2026 10:59

Merge branch 'master' into print-data-context

3335991

Merge branch 'master' into print-data-context

c794cd3

bveeramani merged commit 598ca8d into ray-project:master Mar 7, 2026
5 of 6 checks passed

rayhhome deleted the print-data-context branch March 10, 2026 19:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Data] Print data context in JSON (Dictionary) Format#61428

[Data] Print data context in JSON (Dictionary) Format#61428
bveeramani merged 36 commits into
ray-project:masterfrom
rayhhome:print-data-context

rayhhome commented Mar 2, 2026 •

edited

Loading

Copilot AI left a comment

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

goutamvenkat-anyscale Mar 3, 2026

rayhhome Mar 3, 2026

bveeramani Mar 5, 2026

rayhhome Mar 5, 2026

Uh oh!

Labels

4 participants

Uh oh!

Conversation

rayhhome commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related issues

Additional information

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

goutamvenkat-anyscale Mar 3, 2026

Choose a reason for hiding this comment

rayhhome Mar 3, 2026

Choose a reason for hiding this comment

bveeramani Mar 5, 2026

Choose a reason for hiding this comment

rayhhome Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

Labels

4 participants

rayhhome commented Mar 2, 2026 •

edited

Loading