Skip to content

[runtime_env] Support .tar.gz archives for remote working_dir URIs#62813

Merged
edoakes merged 6 commits into
ray-project:masterfrom
ankushbbbr:support-tar-gz-working-dir
May 17, 2026
Merged

[runtime_env] Support .tar.gz archives for remote working_dir URIs#62813
edoakes merged 6 commits into
ray-project:masterfrom
ankushbbbr:support-tar-gz-working-dir

Conversation

@ankushbbbr

Copy link
Copy Markdown
Contributor

Summary

  • Extends working_dir (and py_modules) remote URI support to accept .tar.gz and .tgz archives in addition to .zip
  • Adds untar_package with path traversal protection (skips symlinks, validates resolved paths stay within target)
  • Updates parse_uri to preserve compound extensions (.tar.gz, .tar.bz2) so that local directory naming and suffix detection work correctly

Why is this change needed?

Many CI/build systems (Bazel, pip, conda) produce .tar.gz archives as their primary artifact format. The previous .zip-only restriction forced users to add a costly conversion step (download tar.gz → repackage as zip → re-upload), adding latency, storage overhead, and complexity — especially in KubeRay/RayJob workflows.

Closes #62811

Changes

File Change
packaging.py Add import tarfile, is_tar_gz_uri, untar_package, get_top_level_dir_from_tar_package; update parse_uri for compound extensions; update download_and_unpack_package to handle tar; fix get_local_dir_from_uri and delete_package for double extensions
working_dir.py Accept .tar.gz/.tgz in remote URI validation and local archive detection
py_modules.py Accept .tar.gz/.tgz in remote URI validation
validation.py Accept .tar.gz/.tgz in generic URI validation
protocol.py Update comment to reflect new supported formats
Tests Add unit tests for untar_package, parse_uri with tar.gz, is_tar_gz_uri, get_local_dir_from_uri with tar.gz, download_and_unpack_package with file:// tar.gz URI, path traversal protection; update error message assertions

Test plan

  • test_parse_uri_tar_gz — verifies compound extension preservation
  • test_is_tar_gz_uri — verifies URI detection
  • test_get_local_dir_from_uri_tar_gz — verifies directory naming
  • test_untar_package_without_top_level_dir — basic extraction
  • test_untar_package_with_top_level_dir — top-level directory stripping
  • test_untar_package_path_traversal — security: blocks ../ attacks
  • test_get_top_level_dir_from_tar_package — top-level detection
  • test_download_and_unpack_package_with_file_uri_tar_gz — end-to-end with file:// protocol
  • Updated validation tests pass with new error messages and .tar.gz/.tgz as valid inputs

Made with Cursor

@ankushbbbr ankushbbbr requested a review from a team as a code owner April 21, 2026 03:19
@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

Comment thread python/ray/_private/runtime_env/working_dir.py
@ankushbbbr ankushbbbr force-pushed the support-tar-gz-working-dir branch from 5e7e2f0 to 42b18b7 Compare April 21, 2026 03:27
Comment thread python/ray/_private/runtime_env/packaging.py
Comment thread python/ray/_private/runtime_env/packaging.py
Comment thread python/ray/_private/runtime_env/packaging.py
@ray-gardener ray-gardener Bot added core Issues that should be addressed in Ray Core community-contribution Contributed by the community labels Apr 21, 2026
@github-actions

github-actions Bot commented May 5, 2026

Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

@github-actions github-actions Bot added the stale The issue is stale. It will be closed within 7 days unless there are further conversation label May 5, 2026
@edoakes edoakes added the go add ONLY when ready to merge, run all tests label May 5, 2026

@edoakes edoakes left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code changes largely LGTM. I triggered the premerge CI build to run tests.

Can you please update the docstring to indicate the zip files and tar archives are
supported? https://docs.ray.io/en/latest/ray-core/api/doc/ray.runtime_env.RuntimeEnv.html

@ankushbbbr

Copy link
Copy Markdown
Contributor Author

The code changes largely LGTM. I triggered the premerge CI build to run tests.

Can you please update the docstring to indicate the zip files and tar archives are supported? https://docs.ray.io/en/latest/ray-core/api/doc/ray.runtime_env.RuntimeEnv.html

Thanks, updated

Comment thread python/ray/_private/runtime_env/packaging.py
@ankushbbbr ankushbbbr force-pushed the support-tar-gz-working-dir branch 2 times, most recently from 1b8d094 to 090313f Compare May 5, 2026 17:51
@edoakes

edoakes commented May 5, 2026

Copy link
Copy Markdown
Collaborator

Some relevant python tests and linter are failing: https://buildkite.com/ray-project/premerge/builds/65878

@github-actions github-actions Bot added unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it. and removed stale The issue is stale. It will be closed within 7 days unless there are further conversation labels May 6, 2026
@ankushbbbr ankushbbbr force-pushed the support-tar-gz-working-dir branch from e5878a5 to 6617d7e Compare May 11, 2026 17:25
@ankushbbbr

Copy link
Copy Markdown
Contributor Author

Some relevant python tests and linter are failing: https://buildkite.com/ray-project/premerge/builds/65878

@edoakes Fixed it. Can I trigger those tests myself? They didn't seem to run in the default CI stages

Comment thread python/ray/_private/runtime_env/packaging.py
@ankushbbbr

Copy link
Copy Markdown
Contributor Author

@ankushbbbr they are running in the premerge test pipeline: https://buildkite.com/ray-project/premerge/builds/66196#019e1813-794c-47c2-8b6d-22949d21a92b/L1836

The linter is now failing: https://buildkite.com/ray-project/premerge/builds/66196#019e1819-fea4-47c4-9074-f1f89112b176/L316

You can run it locally following these instructions: https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting

Thanks. Fixed all lint errors and verified ruff check in local

@ankushbbbr ankushbbbr force-pushed the support-tar-gz-working-dir branch 4 times, most recently from 9531422 to ddb5f19 Compare May 15, 2026 23:58

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 458cfbec6b4d41b28eb4bdbfe69b8d25afc32e41. Configure here.

if remove_top_level_directory:
top_level_directory = get_top_level_dir_from_tar_package(package_path)
if top_level_directory is not None:
remove_dir_from_filepaths(target_dir, top_level_directory)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tar archive fully decompressed twice during extraction

Low Severity

untar_package fully reads and decompresses the tar archive twice when remove_top_level_directory is True: once during the extraction loop via tar.getmembers(), and again in get_top_level_dir_from_tar_package which re-opens the file and calls tar.getmembers() a second time. Unlike zip files where the central directory is a quick O(1) seek, tar requires a full sequential scan and decompression of the entire archive. For large .tar.gz working directories this doubles the I/O and CPU cost. The top-level directory could be determined during the initial extraction pass instead.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 458cfbec6b4d41b28eb4bdbfe69b8d25afc32e41. Configure here.

Many CI/build systems (Bazel, pip, conda) produce .tar.gz archives as
their primary artifact format. The previous .zip-only restriction forced
users to add costly conversion steps. This extends remote URI support to
accept .tar.gz and .tgz archives in addition to .zip.

Changes:
- Update parse_uri to preserve compound extensions (.tar.gz, .tar.bz2)
- Add untar_package with path traversal protection
- Update validation in working_dir, py_modules, and validation modules
- Update get_local_dir_from_uri and delete_package for compound extensions
- Add comprehensive tests for tar.gz support

Closes ray-project#62811

Signed-off-by: Ankush Babbar <ababbar@stripe.com>
Made-with: Cursor
Committed-By-Agent: cursor
ababbar-stripe and others added 5 commits May 16, 2026 12:21
get_uri_for_package unconditionally hardcoded .zip for GCS URIs, so
local .tar.gz/.tgz working_dir archives were uploaded under a .zip URI.
Workers then called unzip_package on tar content, crashing with
BadZipFile. Preserve the original archive extension and add tar.gz
handling to the GCS download path in download_and_unpack_package.

Signed-off-by: Ankush Babbar <ababbar@stripe.com>
Made-with: Cursor
Committed-By-Agent: cursor
…members

GNU tar commonly prefixes archive members with "./" (e.g.,
./mydir/file.txt). The split("/")[0] check returned "." as the
top-level directory, causing remove_dir_from_filepaths(target_dir, ".")
to move the entire target directory into a temp dir and destroy the
extracted contents. Normalize member names by stripping leading "./"
and skipping bare "." entries.

Signed-off-by: Ankush Babbar <ababbar@stripe.com>
Made-with: Cursor
Committed-By-Agent: cursor
Update RuntimeEnv class docstring and handling-dependencies.rst to
document that working_dir and py_modules now accept .tar.gz and .tgz
archives in addition to .zip for remote URIs.

Signed-off-by: Ankush Babbar <ababbar@stripe.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Committed-By-Agent: cursor
…ri message

The test expected "Only .zip, .tar.gz, and .tgz files supported..."
but validate_uri() (which parse_and_validate_working_dir calls) raises
"Only .zip, .whl, .tar.gz, and .tgz files supported...". Update the
test regex to match the actual error message.

Signed-off-by: Ankush Babbar <ababbar@stripe.com>
Made-with: Cursor
Committed-By-Agent: cursor
Signed-off-by: Ankush Babbar <ankushbbbr@gmail.com>
Made-with: Cursor
Committed-By-Agent: cursor
@ankushbbbr ankushbbbr force-pushed the support-tar-gz-working-dir branch from 458cfbe to b3aa6a9 Compare May 16, 2026 19:21
@edoakes edoakes merged commit c6e7001 into ray-project:master May 17, 2026
6 checks passed
@ankushbbbr

Copy link
Copy Markdown
Contributor Author

@edoakes Thanks for reviewing & merging my PR! QQ: When will this get released? We need to integrate these changes in our Ray platform at Stripe for tar remote working_dir support

@edoakes

edoakes commented May 18, 2026

Copy link
Copy Markdown
Collaborator

@ankushbbbr we are targeting a release ~next week. If you want to test earlier, you can use the nightly wheels: https://docs.ray.io/en/latest/ray-overview/installation.html#daily-releases-nightlies

TruongQuangPhat pushed a commit to cyhapun/ray-fix-issue that referenced this pull request May 27, 2026
…ay-project#62813)

## Summary

- Extends `working_dir` (and `py_modules`) remote URI support to accept
`.tar.gz` and `.tgz` archives in addition to `.zip`
- Adds `untar_package` with path traversal protection (skips symlinks,
validates resolved paths stay within target)
- Updates `parse_uri` to preserve compound extensions (`.tar.gz`,
`.tar.bz2`) so that local directory naming and suffix detection work
correctly

## Why is this change needed?

Many CI/build systems (Bazel, pip, conda) produce `.tar.gz` archives as
their primary artifact format. The previous `.zip`-only restriction
forced users to add a costly conversion step (download tar.gz →
repackage as zip → re-upload), adding latency, storage overhead, and
complexity — especially in KubeRay/RayJob workflows.

Closes ray-project#62811

## Changes

| File | Change |
|------|--------|
| `packaging.py` | Add `import tarfile`, `is_tar_gz_uri`,
`untar_package`, `get_top_level_dir_from_tar_package`; update
`parse_uri` for compound extensions; update
`download_and_unpack_package` to handle tar; fix
`get_local_dir_from_uri` and `delete_package` for double extensions |
| `working_dir.py` | Accept `.tar.gz`/`.tgz` in remote URI validation
and local archive detection |
| `py_modules.py` | Accept `.tar.gz`/`.tgz` in remote URI validation |
| `validation.py` | Accept `.tar.gz`/`.tgz` in generic URI validation |
| `protocol.py` | Update comment to reflect new supported formats |
| Tests | Add unit tests for `untar_package`, `parse_uri` with tar.gz,
`is_tar_gz_uri`, `get_local_dir_from_uri` with tar.gz,
`download_and_unpack_package` with `file://` tar.gz URI, path traversal
protection; update error message assertions |

## Test plan

- [x] `test_parse_uri_tar_gz` — verifies compound extension preservation
- [x] `test_is_tar_gz_uri` — verifies URI detection
- [x] `test_get_local_dir_from_uri_tar_gz` — verifies directory naming
- [x] `test_untar_package_without_top_level_dir` — basic extraction
- [x] `test_untar_package_with_top_level_dir` — top-level directory
stripping
- [x] `test_untar_package_path_traversal` — security: blocks `../`
attacks
- [x] `test_get_top_level_dir_from_tar_package` — top-level detection
- [x] `test_download_and_unpack_package_with_file_uri_tar_gz` —
end-to-end with `file://` protocol
- [x] Updated validation tests pass with new error messages and
`.tar.gz`/`.tgz` as valid inputs

Made with [Cursor](https://cursor.com)

---------

Signed-off-by: Ankush Babbar <ababbar@stripe.com>
Signed-off-by: Ankush Babbar <ankushbbbr@gmail.com>
Co-authored-by: Ankush Babbar <ababbar@stripe.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Signed-off-by: phattruong <23120318@student.hcmus.edu.vn>
jpatra72 pushed a commit to jpatra72/ray that referenced this pull request Jun 25, 2026
Replace the duplicated .tar.gz / .tar.bz2 detection inside the length
check with `compound_ext or Path(package_name).suffix`. The compound
extension is already extracted earlier in the function (added in ray-project#62813),
so reusing it removes the duplication and means any new compound type
added upstream is picked up automatically.
jpatra72 pushed a commit to jpatra72/ray that referenced this pull request Jun 25, 2026
Replace the duplicated .tar.gz / .tar.bz2 detection inside the length
check with `compound_ext or Path(package_name).suffix`. The compound
extension is already extracted earlier in the function (added in ray-project#62813),
so reusing it removes the duplication and means any new compound type
added upstream is picked up automatically.

Signed-off-by: jpatra72 <jyotirmaya@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution Contributed by the community core Issues that should be addressed in Ray Core go add ONLY when ready to merge, run all tests unstale A PR that has been marked unstale. It will not get marked stale again if this label is on it.

3 participants