[runtime_env] Support .tar.gz archives for remote working_dir URIs#62813
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
5e7e2f0 to
42b18b7
Compare
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
edoakes
left a comment
There was a problem hiding this comment.
The code changes largely LGTM. I triggered the premerge CI build to run tests.
Can you please update the docstring to indicate the zip files and tar archives are
supported? https://docs.ray.io/en/latest/ray-core/api/doc/ray.runtime_env.RuntimeEnv.html
Thanks, updated |
1b8d094 to
090313f
Compare
|
Some relevant python tests and linter are failing: https://buildkite.com/ray-project/premerge/builds/65878 |
e5878a5 to
6617d7e
Compare
@edoakes Fixed it. Can I trigger those tests myself? They didn't seem to run in the default CI stages |
|
@ankushbbbr they are running in the premerge test pipeline: https://buildkite.com/ray-project/premerge/builds/66196#019e1813-794c-47c2-8b6d-22949d21a92b/L1836 The linter is now failing: https://buildkite.com/ray-project/premerge/builds/66196#019e1819-fea4-47c4-9074-f1f89112b176/L316 You can run it locally following these instructions: https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting |
Thanks. Fixed all lint errors and verified |
9531422 to
ddb5f19
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 458cfbec6b4d41b28eb4bdbfe69b8d25afc32e41. Configure here.
| if remove_top_level_directory: | ||
| top_level_directory = get_top_level_dir_from_tar_package(package_path) | ||
| if top_level_directory is not None: | ||
| remove_dir_from_filepaths(target_dir, top_level_directory) |
There was a problem hiding this comment.
Tar archive fully decompressed twice during extraction
Low Severity
untar_package fully reads and decompresses the tar archive twice when remove_top_level_directory is True: once during the extraction loop via tar.getmembers(), and again in get_top_level_dir_from_tar_package which re-opens the file and calls tar.getmembers() a second time. Unlike zip files where the central directory is a quick O(1) seek, tar requires a full sequential scan and decompression of the entire archive. For large .tar.gz working directories this doubles the I/O and CPU cost. The top-level directory could be determined during the initial extraction pass instead.
Reviewed by Cursor Bugbot for commit 458cfbec6b4d41b28eb4bdbfe69b8d25afc32e41. Configure here.
Many CI/build systems (Bazel, pip, conda) produce .tar.gz archives as their primary artifact format. The previous .zip-only restriction forced users to add costly conversion steps. This extends remote URI support to accept .tar.gz and .tgz archives in addition to .zip. Changes: - Update parse_uri to preserve compound extensions (.tar.gz, .tar.bz2) - Add untar_package with path traversal protection - Update validation in working_dir, py_modules, and validation modules - Update get_local_dir_from_uri and delete_package for compound extensions - Add comprehensive tests for tar.gz support Closes ray-project#62811 Signed-off-by: Ankush Babbar <ababbar@stripe.com> Made-with: Cursor Committed-By-Agent: cursor
get_uri_for_package unconditionally hardcoded .zip for GCS URIs, so local .tar.gz/.tgz working_dir archives were uploaded under a .zip URI. Workers then called unzip_package on tar content, crashing with BadZipFile. Preserve the original archive extension and add tar.gz handling to the GCS download path in download_and_unpack_package. Signed-off-by: Ankush Babbar <ababbar@stripe.com> Made-with: Cursor Committed-By-Agent: cursor
…members
GNU tar commonly prefixes archive members with "./" (e.g.,
./mydir/file.txt). The split("/")[0] check returned "." as the
top-level directory, causing remove_dir_from_filepaths(target_dir, ".")
to move the entire target directory into a temp dir and destroy the
extracted contents. Normalize member names by stripping leading "./"
and skipping bare "." entries.
Signed-off-by: Ankush Babbar <ababbar@stripe.com>
Made-with: Cursor
Committed-By-Agent: cursor
Update RuntimeEnv class docstring and handling-dependencies.rst to document that working_dir and py_modules now accept .tar.gz and .tgz archives in addition to .zip for remote URIs. Signed-off-by: Ankush Babbar <ababbar@stripe.com> Co-authored-by: Cursor <cursoragent@cursor.com> Committed-By-Agent: cursor
…ri message The test expected "Only .zip, .tar.gz, and .tgz files supported..." but validate_uri() (which parse_and_validate_working_dir calls) raises "Only .zip, .whl, .tar.gz, and .tgz files supported...". Update the test regex to match the actual error message. Signed-off-by: Ankush Babbar <ababbar@stripe.com> Made-with: Cursor Committed-By-Agent: cursor
Signed-off-by: Ankush Babbar <ankushbbbr@gmail.com> Made-with: Cursor Committed-By-Agent: cursor
458cfbe to
b3aa6a9
Compare
|
@edoakes Thanks for reviewing & merging my PR! QQ: When will this get released? We need to integrate these changes in our Ray platform at Stripe for tar remote working_dir support |
|
@ankushbbbr we are targeting a release ~next week. If you want to test earlier, you can use the nightly wheels: https://docs.ray.io/en/latest/ray-overview/installation.html#daily-releases-nightlies |
…ay-project#62813) ## Summary - Extends `working_dir` (and `py_modules`) remote URI support to accept `.tar.gz` and `.tgz` archives in addition to `.zip` - Adds `untar_package` with path traversal protection (skips symlinks, validates resolved paths stay within target) - Updates `parse_uri` to preserve compound extensions (`.tar.gz`, `.tar.bz2`) so that local directory naming and suffix detection work correctly ## Why is this change needed? Many CI/build systems (Bazel, pip, conda) produce `.tar.gz` archives as their primary artifact format. The previous `.zip`-only restriction forced users to add a costly conversion step (download tar.gz → repackage as zip → re-upload), adding latency, storage overhead, and complexity — especially in KubeRay/RayJob workflows. Closes ray-project#62811 ## Changes | File | Change | |------|--------| | `packaging.py` | Add `import tarfile`, `is_tar_gz_uri`, `untar_package`, `get_top_level_dir_from_tar_package`; update `parse_uri` for compound extensions; update `download_and_unpack_package` to handle tar; fix `get_local_dir_from_uri` and `delete_package` for double extensions | | `working_dir.py` | Accept `.tar.gz`/`.tgz` in remote URI validation and local archive detection | | `py_modules.py` | Accept `.tar.gz`/`.tgz` in remote URI validation | | `validation.py` | Accept `.tar.gz`/`.tgz` in generic URI validation | | `protocol.py` | Update comment to reflect new supported formats | | Tests | Add unit tests for `untar_package`, `parse_uri` with tar.gz, `is_tar_gz_uri`, `get_local_dir_from_uri` with tar.gz, `download_and_unpack_package` with `file://` tar.gz URI, path traversal protection; update error message assertions | ## Test plan - [x] `test_parse_uri_tar_gz` — verifies compound extension preservation - [x] `test_is_tar_gz_uri` — verifies URI detection - [x] `test_get_local_dir_from_uri_tar_gz` — verifies directory naming - [x] `test_untar_package_without_top_level_dir` — basic extraction - [x] `test_untar_package_with_top_level_dir` — top-level directory stripping - [x] `test_untar_package_path_traversal` — security: blocks `../` attacks - [x] `test_get_top_level_dir_from_tar_package` — top-level detection - [x] `test_download_and_unpack_package_with_file_uri_tar_gz` — end-to-end with `file://` protocol - [x] Updated validation tests pass with new error messages and `.tar.gz`/`.tgz` as valid inputs Made with [Cursor](https://cursor.com) --------- Signed-off-by: Ankush Babbar <ababbar@stripe.com> Signed-off-by: Ankush Babbar <ankushbbbr@gmail.com> Co-authored-by: Ankush Babbar <ababbar@stripe.com> Co-authored-by: Cursor <cursoragent@cursor.com> Signed-off-by: phattruong <23120318@student.hcmus.edu.vn>
Replace the duplicated .tar.gz / .tar.bz2 detection inside the length check with `compound_ext or Path(package_name).suffix`. The compound extension is already extracted earlier in the function (added in ray-project#62813), so reusing it removes the duplication and means any new compound type added upstream is picked up automatically.
Replace the duplicated .tar.gz / .tar.bz2 detection inside the length check with `compound_ext or Path(package_name).suffix`. The compound extension is already extracted earlier in the function (added in ray-project#62813), so reusing it removes the duplication and means any new compound type added upstream is picked up automatically. Signed-off-by: jpatra72 <jyotirmaya@anyscale.com>


Summary
working_dir(andpy_modules) remote URI support to accept.tar.gzand.tgzarchives in addition to.zipuntar_packagewith path traversal protection (skips symlinks, validates resolved paths stay within target)parse_urito preserve compound extensions (.tar.gz,.tar.bz2) so that local directory naming and suffix detection work correctlyWhy is this change needed?
Many CI/build systems (Bazel, pip, conda) produce
.tar.gzarchives as their primary artifact format. The previous.zip-only restriction forced users to add a costly conversion step (download tar.gz → repackage as zip → re-upload), adding latency, storage overhead, and complexity — especially in KubeRay/RayJob workflows.Closes #62811
Changes
packaging.pyimport tarfile,is_tar_gz_uri,untar_package,get_top_level_dir_from_tar_package; updateparse_urifor compound extensions; updatedownload_and_unpack_packageto handle tar; fixget_local_dir_from_urianddelete_packagefor double extensionsworking_dir.py.tar.gz/.tgzin remote URI validation and local archive detectionpy_modules.py.tar.gz/.tgzin remote URI validationvalidation.py.tar.gz/.tgzin generic URI validationprotocol.pyuntar_package,parse_uriwith tar.gz,is_tar_gz_uri,get_local_dir_from_uriwith tar.gz,download_and_unpack_packagewithfile://tar.gz URI, path traversal protection; update error message assertionsTest plan
test_parse_uri_tar_gz— verifies compound extension preservationtest_is_tar_gz_uri— verifies URI detectiontest_get_local_dir_from_uri_tar_gz— verifies directory namingtest_untar_package_without_top_level_dir— basic extractiontest_untar_package_with_top_level_dir— top-level directory strippingtest_untar_package_path_traversal— security: blocks../attackstest_get_top_level_dir_from_tar_package— top-level detectiontest_download_and_unpack_package_with_file_uri_tar_gz— end-to-end withfile://protocol.tar.gz/.tgzas valid inputsMade with Cursor