[Data] Log DataContext at the beginning of execution for configuration traceability#61150
Conversation
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request improves the traceability of dataset executions by logging the ExecutionOptions and DataContext at the beginning of execution. The use of pprint enhances the readability of the logged configurations. My review includes a suggestion to combine the logging statements for better conciseness and to group related information.
There was a problem hiding this comment.
Pull request overview
This PR enhances execution traceability by logging both the DataContext and ExecutionOptions at the beginning of dataset execution. The logging uses pprint.pformat() to improve readability of the logged configuration objects.
Changes:
- Added
pprintimport for formatted output - Modified logging to use
pprint.pformat(vars(...))for bothExecutionOptionsandDataContext - Added new logging statement for
DataContextconfiguration
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
… into print-data-context
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
goutamvenkat-anyscale
left a comment
There was a problem hiding this comment.
Just one comment
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
| logger.debug("Execution config: %s", self._options) | ||
| # Log the full DataContext for traceability | ||
| if logger.isEnabledFor(logging.DEBUG) and log_once( | ||
| f"ray_data_log_context_{self._dataset_id}" |
There was a problem hiding this comment.
log_once key includes execution index causing repeat logging
Medium Severity
The log_once key uses self._dataset_id, which includes the execution index (_run_index) that increments with each execution. This causes the DataContext to be logged on every execution of the same dataset (e.g., in multiple training epochs), rather than once per dataset as intended per the PR discussion. The PR author stated "Went with log once for each dataset based on id", but the implementation logs once per execution instead.
There was a problem hiding this comment.
I think logging once per execution is reasonable. Don't want to bikeshed on this
## Description Print Log DataContext in JSON (dictionary) format with `sanitize_for_struct` instead of with `pprint`. ## Related issues Improves upon #61150. ## Additional information **Using `sanitize_for_struct` instead of `json.dump` directly because the following fields in `DataContext` are not JSON serializable:** - `execution_options: ExecutionOptions`, which is a regular class; - `checkpoint_config: Optional[CheckpointConfig]`, which is a regular class; - `issue_detectors_config: IssueDetectorsConfiguration`, which contains a list of class type objects: `List[Type[IssueDetector]]`; - `scheduling_strategy: SchedulingStrategyT`, which is a `Union` involving regular classes; - `custom_execution_callback_classes: List[Type["ExecutionCallback"]]`, which is a list of class types. Example log (Ellipses are for readability of this PR, the log will contain full DataContext in actual execution): DEBUG streaming_executor.py:197 -- Data Context for dataset dataset_2_0: {'target_max_block_size': 134217728, 'target_min_block_size': 1048576, 'streaming_read_buffer_size': 33554432, ...} --------- Signed-off-by: Sirui Huang <ray.huang@anyscale.com>
) ## Description Print Log DataContext in JSON (dictionary) format with `sanitize_for_struct` instead of with `pprint`. ## Related issues Improves upon ray-project#61150. ## Additional information **Using `sanitize_for_struct` instead of `json.dump` directly because the following fields in `DataContext` are not JSON serializable:** - `execution_options: ExecutionOptions`, which is a regular class; - `checkpoint_config: Optional[CheckpointConfig]`, which is a regular class; - `issue_detectors_config: IssueDetectorsConfiguration`, which contains a list of class type objects: `List[Type[IssueDetector]]`; - `scheduling_strategy: SchedulingStrategyT`, which is a `Union` involving regular classes; - `custom_execution_callback_classes: List[Type["ExecutionCallback"]]`, which is a list of class types. Example log (Ellipses are for readability of this PR, the log will contain full DataContext in actual execution): DEBUG streaming_executor.py:197 -- Data Context for dataset dataset_2_0: {'target_max_block_size': 134217728, 'target_min_block_size': 1048576, 'streaming_read_buffer_size': 33554432, ...} --------- Signed-off-by: Sirui Huang <ray.huang@anyscale.com> Signed-off-by: Parag Ekbote <thecoolekbote189@gmail.com>


Description
Print out
DataContextat the beginning of each dataset execution to make the entire execution configuration traceable.Additional information
DataContextcontains the originally loggedExecutionOptions.Example log (Ellipses are for readability of this PR, the log will contain full DataContext in actual execution):
DEBUG streaming_executor.py:191 -- Data Context:
{'_checkpoint_config': None,
'_enable_actor_pool_on_exit_hook': False,
'_execution_idx': 0,
'_kv_configs': {},
'_max_num_blocks_in_streaming_gen_buffer': 2,
'_shuffle_strategy': <ShuffleStrategy.HASH_SHUFFLE: 'hash_shuffle'>,
'_task_pool_data_task_remote_args': {},
...
'write_file_retry_on_errors': ('AWS Error INTERNAL_FAILURE',
'AWS Error NETWORK_CONNECTION',
'AWS Error SLOW_DOWN',
'AWS Error UNKNOWN (HTTP status 503)')}