[Data] Print data context in JSON (Dictionary) Format#61428
Conversation
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
… into print-data-context
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
There was a problem hiding this comment.
Pull request overview
This PR improves upon #61150, which introduced logging of DataContext at the start of dataset execution. The change replaces pprint.pformat(self._data_context) with asdict(self._data_context) when formatting the debug log message, with the goal of producing more structured (dictionary-based) output.
Changes:
- Removes
import pprintand addsfrom dataclasses import asdict - Replaces the
pprint.pformat(...)call withasdict(...)when loggingDataContext
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Code Review
This pull request changes the logging of DataContext from using pprint.pformat to dataclasses.asdict. While this is a good simplification for a dataclass, the resulting log output will be a single-line string representation of a dictionary, which can be hard to read for large contexts and isn't valid JSON as stated in the PR description. My feedback suggests using the json module to produce a properly formatted and readable JSON string, which aligns better with the PR's intent and improves log readability.
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
| logger.debug( | ||
| f"Data Context for dataset {self._dataset_id}:\n%s", | ||
| pprint.pformat(self._data_context), | ||
| sanitize_for_struct( |
There was a problem hiding this comment.
will this truncate nesting inside the datacontext, or does it truncate the whole context? I believe we want to log the full context and not lose keys
There was a problem hiding this comment.
This might truncate nesting inside the datacontext, e.g.
'execution_options': 'ExecutionOptions(resource_limits=ExecutionResources(cpu=inf, gpu=inf, object_store_memory=inf, memor...'
I tried to mitigate this issue by setting truncate_length to DATA_CONTEXT_LOG_TRUNCATE_LENGTH. My testing shows that none of the fields would be truncated in the debug output with the current configuration (log.txt)
There was a problem hiding this comment.
Based on the current truncate_length passed into (which is DATA_CONTEXT_LOG_TRUNCATE_LENGTH, i.e. 10000), the DataContext will not be truncated unless there's any string with more than 10000 characters or any list with more than 10000 elements, which I believe is unlikely.
Currently, calling json.dumps directly on DataContext raises a exception because there are a few fields in DataContext that are not JSON serializable (I've documented such fields in the PR description). I can use _json_default to get around this though, do we want to switch to using json.dumps instead of sanitize_for_struct?
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
) ## Description Print Log DataContext in JSON (dictionary) format with `sanitize_for_struct` instead of with `pprint`. ## Related issues Improves upon ray-project#61150. ## Additional information **Using `sanitize_for_struct` instead of `json.dump` directly because the following fields in `DataContext` are not JSON serializable:** - `execution_options: ExecutionOptions`, which is a regular class; - `checkpoint_config: Optional[CheckpointConfig]`, which is a regular class; - `issue_detectors_config: IssueDetectorsConfiguration`, which contains a list of class type objects: `List[Type[IssueDetector]]`; - `scheduling_strategy: SchedulingStrategyT`, which is a `Union` involving regular classes; - `custom_execution_callback_classes: List[Type["ExecutionCallback"]]`, which is a list of class types. Example log (Ellipses are for readability of this PR, the log will contain full DataContext in actual execution): DEBUG streaming_executor.py:197 -- Data Context for dataset dataset_2_0: {'target_max_block_size': 134217728, 'target_min_block_size': 1048576, 'streaming_read_buffer_size': 33554432, ...} --------- Signed-off-by: Sirui Huang <ray.huang@anyscale.com> Signed-off-by: Parag Ekbote <thecoolekbote189@gmail.com>
Description
Print Log DataContext in JSON (dictionary) format with
sanitize_for_structinstead of withpprint.Related issues
Improves upon #61150.
Additional information
Using
sanitize_for_structinstead ofjson.dumpdirectly because the following fields inDataContextare not JSON serializable:execution_options: ExecutionOptions, which is a regular class;checkpoint_config: Optional[CheckpointConfig], which is a regular class;issue_detectors_config: IssueDetectorsConfiguration, which contains a list of class type objects:List[Type[IssueDetector]];scheduling_strategy: SchedulingStrategyT, which is aUnioninvolving regular classes;custom_execution_callback_classes: List[Type["ExecutionCallback"]], which is a list of class types.Example log (Ellipses are for readability of this PR, the log will contain full DataContext in actual execution):
DEBUG streaming_executor.py:197 -- Data Context for dataset dataset_2_0:
{'target_max_block_size': 134217728, 'target_min_block_size': 1048576, 'streaming_read_buffer_size': 33554432, ...}