Skip to content

[Data] Gate unsafe deserialization in WebDataset default decoder#63469

Merged
bveeramani merged 4 commits into
masterfrom
data-2359-webdataset-unsafe-pickle
May 19, 2026
Merged

[Data] Gate unsafe deserialization in WebDataset default decoder#63469
bveeramani merged 4 commits into
masterfrom
data-2359-webdataset-unsafe-pickle

Conversation

@bveeramani

@bveeramani bveeramani commented May 18, 2026

Copy link
Copy Markdown
Member

Description

The default decoder in read_webdataset runs pickle.loads on .pkl/.pickle files and torch.load(weights_only=False) on .pt/.pth files from attacker-controlled TAR archives, enabling arbitrary code execution with no opt-in. This is the same class of bug as GHSA-mw35-8rx3-xf9r (patched in 2.55.0 for Parquet), but in a different code path that was not addressed.

This PR gates the unsafe .pkl/.pickle and .pt/.pth branches in _default_decoder behind the RAY_DATA_WEBDATASET_ALLOW_UNSAFE_DESERIALIZATION=1 environment variable. The error message points users to the existing decoder parameter on ray.data.read_webdataset() as the safe escape hatch for custom deserialization.

Related issues

Fixes GHSA-hhrp-gw25-jr43

Additional information

  • Existing tests that rely on .pt round-trip encoding/decoding now use a monkeypatch.setenv fixture to opt in
  • New tests verify: rejection of all four unsafe extensions by default, opt-in via env var, and bypass via custom decoder callable
…nd env var

The default decoder in `read_webdataset` runs `pickle.loads` on `.pkl/.pickle`
files and `torch.load(weights_only=False)` on `.pt/.pth` files from
attacker-controlled TAR archives, enabling arbitrary code execution. Gate
these branches behind `RAY_DATA_WEBDATASET_ALLOW_UNSAFE_DESERIALIZATION=1`
and point users to the existing `decoder` parameter as the safe escape hatch.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani requested a review from a team as a code owner May 18, 2026 20:39
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit cc6a31b. Configure here.

Comment thread python/ray/data/tests/datasource/test_webdataset.py
@bveeramani bveeramani enabled auto-merge (squash) May 18, 2026 21:34
@github-actions github-actions Bot added the go add ONLY when ready to merge, run all tests label May 18, 2026
_allow_unsafe_deserialization() checked os.environ on workers, which
never saw monkeypatch.setenv from the driver. Instead, read the env
var in WebDatasetDatasource.__init__ (driver-side), store the result
as self._allow_unsafe_deserialization, and pass it to _default_decoder
via functools.partial. The datasource is pickled to workers, so the
flag propagates automatically.

Also fix custom decoder test key match (pkl, not .pkl).

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@github-actions github-actions Bot disabled auto-merge May 19, 2026 00:49
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
… messages

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani enabled auto-merge (squash) May 19, 2026 01:11
@bveeramani bveeramani merged commit 41443a1 into master May 19, 2026
7 checks passed
@bveeramani bveeramani deleted the data-2359-webdataset-unsafe-pickle branch May 19, 2026 01:37
@ray-gardener ray-gardener Bot added the data Ray Data-related issues label May 19, 2026
TruongQuangPhat pushed a commit to cyhapun/ray-fix-issue that referenced this pull request May 27, 2026
…-project#63469)

## Description

The default decoder in `read_webdataset` runs `pickle.loads` on
`.pkl/.pickle` files and `torch.load(weights_only=False)` on `.pt/.pth`
files from attacker-controlled TAR archives, enabling arbitrary code
execution with no opt-in. This is the same class of bug as
GHSA-mw35-8rx3-xf9r (patched in 2.55.0 for Parquet), but in a different
code path that was not addressed.

This PR gates the unsafe `.pkl/.pickle` and `.pt/.pth` branches in
`_default_decoder` behind the
`RAY_DATA_WEBDATASET_ALLOW_UNSAFE_DESERIALIZATION=1` environment
variable. The error message points users to the existing `decoder`
parameter on `ray.data.read_webdataset()` as the safe escape hatch for
custom deserialization.

## Related issues

Fixes GHSA-hhrp-gw25-jr43

## Additional information

- Existing tests that rely on `.pt` round-trip encoding/decoding now use
a `monkeypatch.setenv` fixture to opt in
- New tests verify: rejection of all four unsafe extensions by default,
opt-in via env var, and bypass via custom `decoder` callable

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: phattruong <23120318@student.hcmus.edu.vn>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data Ray Data-related issues go add ONLY when ready to merge, run all tests

2 participants