[Data][LLM] Add should_continue_on_error for graceful error handling in batch inference#59212
Conversation
…h inference Add `continue_on_error` parameter to vLLM batch processor config. When enabled, inference failures yield error rows instead of crashing the job. Scoped to Ray Data LLM batch inference only; no changes to Ray Data core. Addresses: ray-project#52449 Related: ray-project#52457 When running LLM batch inference at scale, a single bad row (e.g., prompt exceeding max_model_len) can crash the whole batch. Add optional `continue_on_error` parameter to processor config. The parameter defaults to False, preserving existing fail-fast behavior. When set to True: - Catch exceptions in vLLMEngineStageUDF - Failed rows yield with `__inference_error__` column set to error message - Successful rows have `__inference_error__: None` - Error rows bypass postprocess (avoids crashes when expected fields missing) - Job completes with mixed success/failure outputs - Users filter downstream: ds.filter(lambda r: r["__inference_error__"] is None) 1. **Default behavior unchanged**: `continue_on_error=False` preserves existing fail-fast semantics. This is opt-in only. 2. **Error rows bypass postprocess**: User's postprocess function likely expects `generated_text` and other output fields. Error rows won't have these, so we skip postprocess to avoid secondary crashes. 3. **Error as Optional[str]**: The `__inference_error__` column is None on success, or contains the error message (with type) on failure. This provides debuggability while keeping schema simple. 4. **LLM operator only**: Per feedback, this is scoped to the LLM processor implementation. No changes to Ray Data core primitives. - **Silent vs visible failures**: Choose visible failures (error column) over silent dropping for observability. - **Schema addition**: All outputs now include `__inference_error__` column. This is necessary for users to distinguish success from failure and debug. - **No retry mechanism**: Retrying and auto-tuning is outside the scope of this PR. --- - `python/ray/data/llm.py` - Document new parameter in public API - `python/ray/llm/_internal/batch/processor/base.py` - Add `continue_on_error` to config - `python/ray/llm/_internal/batch/processor/vllm_engine_proc.py` - Pass config to stage - `python/ray/llm/_internal/batch/stages/base.py` - Skip postprocess for error rows - `python/ray/llm/_internal/batch/stages/vllm_engine_stage.py` - Catch errors and yield error rows ```python from ray.data.llm import vLLMEngineProcessorConfig, build_llm_processor config = vLLMEngineProcessorConfig( model_source="meta-llama/Meta-Llama-3.1-8B-Instruct", continue_on_error=True, # Enable graceful error handling ) processor = build_llm_processor( config, preprocess=lambda row: dict( messages=[{"role": "user", "content": row["prompt"]}], sampling_params=dict(temperature=0.3, max_tokens=100), ), postprocess=lambda row: dict( response=row["generated_text"], ), ) ds = ray.data.read_json("prompts.json") result = processor(ds) successful = result.filter(lambda r: r["__inference_error__"] is None) successful.write_json("outputs/") failed = result.filter(lambda r: r["__inference_error__"] is not None) print(f"Failed: {failed.count()} rows") failed.show(5) ``` `python/ray/llm/tests/batch/cpu/stages/test_stage_base.py`: - `test_wrap_postprocess_bypasses_error_rows` - `test_wrap_postprocess_success_rows_run_postprocess` `python/ray/llm/tests/batch/gpu/stages/test_vllm_engine_stage.py`: - `test_vllm_udf_default_raises_on_error` - `test_vllm_udf_continue_on_error_yields_error_row` - `test_vllm_udf_mixed_success_and_error` Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a continue_on_error parameter for graceful error handling during batch inference, which is a great addition for robustness at scale. The implementation correctly isolates the error handling within the vLLM engine stage and provides a mechanism to flag failed rows using the __inference_error__ column.
My review focuses on improving the debuggability of failed requests and ensuring schema consistency in the output. Specifically, I've suggested:
- Preserving more context in error rows to make it easier to identify the problematic input.
- Ensuring the
__inference_error__column is present in all output rows (both success and failure) for a consistent schema, as described in the PR description. - Including request parameters in the error output for more complete debugging information.
Overall, the changes are well-structured and the addition of tests is thorough. Addressing these points will make the feature even more user-friendly and robust.
- Distinguish fatal (EngineDeadError) vs recoverable errors; fatal errors always propagate even when continue_on_error=True - Error rows bypass downstream stage UDFs to prevent crashes when expected fields (e.g., generated_tokens) are missing - Include original prompt in error rows for debuggability - Add __inference_error__ column to success rows when continue_on_error=True for consistent output schema; no schema change when False (backwards compatible) - Add tests for fatal error propagation, error row bypass, and schema consistency Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a continue_on_error flag for graceful error handling in vLLM batch inference, which is an excellent addition for improving the robustness of large-scale inference jobs. The implementation is well-structured, creating error rows with an __inference_error__ column instead of crashing. My review identified a high-severity issue where debugging information added to error rows is later stripped out during post-processing. I've provided a suggestion to fix this, along with corresponding updates to the tests to ensure the debugging information is preserved as intended.
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a continue_on_error parameter for vLLM batch inference in Ray Data. This allows inference jobs to continue even if some rows fail, by yielding error rows with an __inference_error__ column instead of crashing. The changes are well-structured, with error handling logic localized to the vLLM stage and a generic mechanism for bypassing error rows in subsequent stages. The implementation correctly distinguishes between recoverable and fatal vLLM errors. The changes are also well-covered by new unit tests. My main feedback is a minor suggestion to replace a magic number with a constant for better code maintainability.
| if len(prompt) > 500: | ||
| prompt = prompt[:500] + "...[truncated]" |
kouroshHakha
left a comment
There was a problem hiding this comment.
LGTM. just some minor comments
| if len(prompt) > 500: | ||
| prompt = prompt[:500] + "...[truncated]" |
| # Assign the index of the row in the batch to the idx_in_batch_column. | ||
| # This is beacuse the UDF output may be out-of-order (if asyncio.as_completed | ||
| # is used interanlly for example), and we need to carry over unused input | ||
| # This is because the UDF output may be out-of-order (if asyncio.as_completed |
There was a problem hiding this comment.
Bug: Validation fails for error rows before bypass logic
The validate_inputs call at line 171 runs on all rows before error rows are separated (lines 181-190). Error rows from vLLMEngineStage lack required keys like generated_tokens that downstream stages such as DetokenizeStage expect. This causes validation to fail with "Required input keys not found" before error rows can be bypassed. The tests avoid this by passing expected_input_keys=None, but in production the DetokenizeStage has required keys. The validation needs to skip or be called after error row separation.
Additional Locations (1)
| # to avoid bloating the output. | ||
| prompt = row.get("prompt", "") | ||
| if len(prompt) > 500: | ||
| prompt = prompt[:500] + "...[truncated]" |
There was a problem hiding this comment.
Bug: Error handler crashes if prompt is None
The error handler at line 619 uses row.get("prompt", "") to retrieve the prompt for debugging. However, Python's dict.get() only returns the default when the key is missing - if the key exists with value None, it returns None. Then len(prompt) at line 620 raises TypeError: object of type 'NoneType' has no len(). This crashes inside the except Exception block, causing the job to fail despite continue_on_error=True, which defeats the purpose of graceful error handling.
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
kouroshHakha
left a comment
There was a problem hiding this comment.
LGTM just a few non-blockers?
| """ | ||
| super().__init__(data_column, expected_input_keys) | ||
| self.model = model | ||
| self.continue_on_error = continue_on_error |
There was a problem hiding this comment.
can we renamed this parameter to should _continue_on_error globally?
| "time_taken_llm": time_taken_llm, | ||
| "params": str(request.params), | ||
| } | ||
| output = await resp |
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
| # Include snippet of failed prompt | ||
| prompt = row.get("prompt", "") | ||
| if len(prompt) > _MAX_PROMPT_LENGTH_IN_ERROR: | ||
| prompt = prompt[:_MAX_PROMPT_LENGTH_IN_ERROR] + "...[truncated]" |
There was a problem hiding this comment.
Bug: Prompt lost in error rows after early pop
The error handling in _generate_with_error_handling tries to capture the prompt via row.get("prompt", "") on line 629. However, generate_async internally calls _prepare_llm_request which pops the prompt from the row (prompt = row.pop("prompt") on line 326) before the actual LLM call. When an error occurs after the prompt is popped (e.g., during vLLM engine processing), the error row will have an empty string for prompt instead of the actual prompt content, defeating the purpose of including it for debugging. The prompt needs to be captured before calling generate_async.
nrghosh
left a comment
There was a problem hiding this comment.
@kouroshHakha addressed comments, thanks
will add support for serve handles <> data in a separate PR
cc @richardliaw
Update test_vllm_udf_fatal_error_always_raises to verify that fatal errors (EngineDeadError) now trigger ray.actor.exit_actor() for recovery instead of simply re-raising. The original intent (PR ray-project#59212) was that fatal errors should NOT be swallowed by should_continue_on_error. This is preserved - fatal errors still don't yield error rows. The change is that instead of re-raising (which caused infinite retry loops on the same broken actor), we now exit the actor to enable recovery. Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
Description
Add
should_continue_on_errorparameter to vLLM batch processor config. Whenenabled, even nonfatal inference failures yield error rows instead of crashing the job.
Scoped to Ray Data LLM batch inference only; no changes to Ray Data core.
Addresses: #52449
Related: #52457
Problem
When running LLM batch inference at scale, a single bad row (e.g., prompt
exceeding max_model_len) can crash the whole batch.
vLLM AsyncLLM engine distinguishes two types of errors: EngineGenerateError and EngineDeadError.
Solution
Add optional
should_continue_on_errorparameter to processor config. Theparameter defaults to False, preserving existing fail-fast behavior.
When set to True:
__inference_error__column set to error message__inference_error__: Noneds.filter(lambda r: r["__inference_error__"] is None)Design
Default behavior unchanged:
should_continue_on_error=Falsepreservesexisting fail-fast semantics. This is opt-in only.
Error rows bypass postprocess: User's postprocess function likely
expects
generated_textand other output fields. Error rows won't havethese, so we skip postprocess to avoid secondary crashes.
Error as Optional[str]: The
__inference_error__column is None onsuccess, or contains the error message (with type) on failure. This
provides debuggability while keeping schema simple.
LLM operator only: Per feedback, this is scoped to the LLM processor
implementation. No changes to Ray Data core primitives.
Questions
Tradeoffs
Files Changed
python/ray/data/llm.py- Document new parameter in public APIpython/ray/llm/_internal/batch/processor/base.py- Addshould_continue_on_errorto configpython/ray/llm/_internal/batch/processor/vllm_engine_proc.py- Pass config to stagepython/ray/llm/_internal/batch/stages/base.py- Skip postprocess for error rowspython/ray/llm/_internal/batch/stages/vllm_engine_stage.py- Catch errors and yield error rowsExample Usage
Tests
python/ray/llm/tests/batch/cpu/stages/test_stage_base.py:test_wrap_postprocess_bypasses_error_rowstest_wrap_postprocess_success_rows_run_postprocesspython/ray/llm/tests/batch/gpu/stages/test_vllm_engine_stage.py:test_vllm_udf_default_raises_on_errortest_vllm_udf_should_continue_on_error_yields_error_rowtest_vllm_udf_mixed_success_and_error