[train][Docs] Document S3-compatible storage#63103
Conversation
|
Warning Gemini encountered an error creating the review. You can try again by commenting |
1546e25 to
c117eb5
Compare
There was a problem hiding this comment.
Pull request overview
This PR improves Ray Train’s S3-compatible storage experience (notably Backblaze B2) by ensuring credentials provided via Backblaze’s CLI env var names are made visible to pyarrow’s S3 resolver, and by expanding the docs with a B2-focused example.
Changes:
- Add
_alias_s3_compatible_credentials_to_aws_env_vars()and call it fromget_fs_and_path()to map B2 env vars ontoAWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEYwhen appropriate. - Add unit tests covering aliasing behavior, no-op behavior, and warnings.
- Update Train persistent-storage docs with a Backblaze B2 example, endpoint override guidance, and a link to an end-to-end notebook.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| python/ray/train/_internal/storage.py | Adds credential env var aliasing logic and invokes it during filesystem resolution. |
| python/ray/train/tests/test_storage.py | Adds tests validating the new env var aliasing behavior and logging. |
| doc/source/train/user-guides/persistent-storage.rst | Updates S3-compatible storage docs to include Backblaze B2 guidance and a runnable example link. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
c117eb5 to
74a69f2
Compare
96cf884 to
8e00f5b
Compare
8e00f5b to
5c29b68
Compare
|
This pull request has been automatically marked as stale because it has not had You can always ask for help on our discussion forum or Ray's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
There was a problem hiding this comment.
Thanks for the PR @goanpeca, I've used claude to improve the documentation so don't worry about where that random commit came from. I just have a single question about the implementation
| @@ -294,6 +297,35 @@ def _create_directory(fs: pyarrow.fs.FileSystem, fs_path: str) -> None: | |||
| ) | |||
|
|
|||
|
|
|||
| def _alias_s3_compatible_credentials_to_aws_env_vars() -> None: | |||
There was a problem hiding this comment.
Could you provide more information about why this function is necessary? Shouldn't this be done on the user side rather than behind the scenes
There was a problem hiding this comment.
Good question! It is not strictly necessary: the functional path is just AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY plus an endpoint override, which the docs now cover.
The helper is really just a convenience for people already on Backblaze B2. B2's docs and CLI use B2_APPLICATION_KEY_ID / B2_APPLICATION_KEY, so if a user has those exported and points Ray at an s3:// path, pyarrow silently ignores them (it only reads the AWS_ names) and they hit a confusing auth error. The alias saves them re-exporting the same secret, and it is structured so other providers can be added later.
That said, I am happy to drop it and keep this docs-only if you would rather credentials stay explicit on the user side. Just let me know! 😄
There was a problem hiding this comment.
Thanks for the answer @goanpeca, yes, could we remove the function from the code and add a note to the documentation for the changes that users will need to implement for access
ed9c5f6 to
5c29b68
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 5c29b68791fd328f75db0ae016a4063ccffa38cd. Configure here.
5c29b68 to
6b3049f
Compare
6442fa2 to
bb3f8de
Compare
bb3f8de to
c8e4c18
Compare
c8e4c18 to
bd029db
Compare
bd029db to
8ddc19b
Compare
8ddc19b to
a0524bd
Compare
Signed-off-by: Gonzalo Peña-Castellanos <goanpeca@gmail.com>
a0524bd to
98083b8
Compare
pseudo-rnd-thoughts
left a comment
There was a problem hiding this comment.
Thanks for making the changes
## Why are these changes needed? Ray Train already works with any S3-compatible object store through pyarrow's `S3FileSystem` (via `endpoint_override` in the `storage_path` URI, or the standard `AWS_*` environment variables). This PR documents that path in the Train persistent-storage guide and adds the Backblaze B2 specifics. **Docs-only, no code changes.** (An earlier revision added an env-var aliasing helper; per review feedback it was removed in favor of documenting the setup users perform themselves.) Changes to `doc/source/train/user-guides/persistent-storage.rst`: - Retitles the section to "S3-compatible storage (Backblaze B2, MinIO, etc.)". - Shows the `endpoint_override` query-parameter form for Backblaze B2 and MinIO (local). - Notes that the standard AWS environment variables (`AWS_ENDPOINT_URL_S3`, `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`) work with a plain `s3://bucket/path`. - Documents that Backblaze B2 publishes credentials as `B2_APPLICATION_KEY_ID` / `B2_APPLICATION_KEY`; users set `AWS_ACCESS_KEY_ID` / `AWS_SECRET_ACCESS_KEY` to those values, since pyarrow reads only the AWS-named variables. - Links a complete end-to-end Backblaze B2 notebook example. ## Related issue number Related to ray-project#63104 ## Checks - [x] Change is contained to `doc/source/train/user-guides/persistent-storage.rst`. - [x] No code paths changed; existing tests unaffected. Signed-off-by: Gonzalo Peña-Castellanos <goanpeca@gmail.com>
## Why are these changes needed? Ray Train already works with any S3-compatible object store through pyarrow's `S3FileSystem` (via `endpoint_override` in the `storage_path` URI, or the standard `AWS_*` environment variables). This PR documents that path in the Train persistent-storage guide and adds the Backblaze B2 specifics. **Docs-only, no code changes.** (An earlier revision added an env-var aliasing helper; per review feedback it was removed in favor of documenting the setup users perform themselves.) Changes to `doc/source/train/user-guides/persistent-storage.rst`: - Retitles the section to "S3-compatible storage (Backblaze B2, MinIO, etc.)". - Shows the `endpoint_override` query-parameter form for Backblaze B2 and MinIO (local). - Notes that the standard AWS environment variables (`AWS_ENDPOINT_URL_S3`, `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`) work with a plain `s3://bucket/path`. - Documents that Backblaze B2 publishes credentials as `B2_APPLICATION_KEY_ID` / `B2_APPLICATION_KEY`; users set `AWS_ACCESS_KEY_ID` / `AWS_SECRET_ACCESS_KEY` to those values, since pyarrow reads only the AWS-named variables. - Links a complete end-to-end Backblaze B2 notebook example. ## Related issue number Related to ray-project#63104 ## Checks - [x] Change is contained to `doc/source/train/user-guides/persistent-storage.rst`. - [x] No code paths changed; existing tests unaffected. Signed-off-by: Gonzalo Peña-Castellanos <goanpeca@gmail.com>

Why are these changes needed?
Ray Train already works with any S3-compatible object store through pyarrow's
S3FileSystem(viaendpoint_overridein thestorage_pathURI, or the standardAWS_*environment variables). This PR documents that path in the Train persistent-storage guide and adds the Backblaze B2 specifics.Docs-only, no code changes. (An earlier revision added an env-var aliasing helper; per review feedback it was removed in favor of documenting the setup users perform themselves.)
Changes to
doc/source/train/user-guides/persistent-storage.rst:endpoint_overridequery-parameter form for Backblaze B2 and MinIO (local).AWS_ENDPOINT_URL_S3,AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY) work with a plains3://bucket/path.B2_APPLICATION_KEY_ID/B2_APPLICATION_KEY; users setAWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEYto those values, since pyarrow reads only the AWS-named variables.Related issue number
Related to #63104
Checks
doc/source/train/user-guides/persistent-storage.rst.