[doc][DOC-908] exclude Jupyter notebooks from llms-full.txt#63228
Merged
elliot-barn merged 1 commit intoMay 8, 2026
Merged
Conversation
sphinx-llms-txt reads each docname's source verbatim from `_sources/`, so for `.ipynb` pages it appends raw notebook JSON (cells, outputs, embedded base64 images) into the corpus. `llms_txt_exclude` matches docnames (extension stripped) via fnmatch, so a `**/*.ipynb` pattern can't work. Enumerate notebook docnames at conf-load time and append them. Notebooks remain fully rendered in the HTML build; only the agent corpus drops them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
Contributor
There was a problem hiding this comment.
Code Review
This pull request updates doc/source/conf.py to dynamically exclude all Jupyter notebooks from the llms-full.txt corpus. This is achieved by scanning the documentation directory for .ipynb files and adding their relative docnames to the llms_txt_exclude list, preventing raw JSON content from being included in the LLM agent corpus. I have no feedback to provide.
Lucas61000
pushed a commit
to Lucas61000/ray
that referenced
this pull request
May 15, 2026
…ect#63228) ## Why After ray-project#63130 shipped, the generated `llms-full.txt` corpus is polluted by raw Jupyter notebook source. `sphinx-llms-txt` reads each docname's source from `_sources/` verbatim, so for the 127 `.ipynb` pages under `doc/source/` it appends the full notebook JSON (cells, outputs, embedded base64 images, metadata) into the file. That's the largest source of low-signal bytes in the corpus targeted at agents. ## What Append computed notebook docnames to the existing `llms_txt_exclude` list in `doc/source/conf.py`. `llms_txt_exclude` matches docnames (extension stripped) via `fnmatch.fnmatch` — see [`sphinx_llms_txt/collector.py`](https://github.com/jdillard/sphinx-llms-txt/blob/main/sphinx_llms_txt/collector.py). A pattern such as `**/*.ipynb` can't match because the docname carries no extension. The change enumerates `*.ipynb` files under the source directory at conf-load time and converts each path to its docname (relative to the source dir, suffix stripped, posix separators). Scope: - Affects only `llms.txt` / `llms-full.txt`. The Sphinx HTML build is governed by the separate `exclude_patterns` list (line 351 of `conf.py`), which is untouched. All 127 notebooks remain fully rendered on `docs.ray.io`. - Notebooks already in `exclude_patterns` (e.g. `serve/tutorials/video-analysis/*.ipynb`) aren't built, so adding their docnames to `llms_txt_exclude` is a harmless no-op. ## Verification After RtD builds the PR preview: - Fetch `llms-full.txt` from the PR build and confirm no `"cell_type": "code"` / `"output_type": "stream"` / base64 image data appears in the corpus. - Confirm `llms.txt` (the summary index) still resolves and looks sane. - Spot-check that a couple of notebook pages still render normally in the HTML preview. ## Context Tracked under [DOC-908]. Follow-up to ray-project#63130 (DOC-875). Part of [DOC-844] (Agent Ray docs umbrella). Tuning this list further (other low-signal page types) is deferred until the rebuilt corpus can be inspected directly. Signed-off-by: Douglas Strodtman <douglas@anyscale.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
After #63130 shipped, the generated
llms-full.txtcorpus is polluted by raw Jupyter notebook source.sphinx-llms-txtreads each docname's source from_sources/verbatim, so for the 127.ipynbpages underdoc/source/it appends the full notebook JSON (cells, outputs, embedded base64 images, metadata) into the file. That's the largest source of low-signal bytes in the corpus targeted at agents.What
Append computed notebook docnames to the existing
llms_txt_excludelist indoc/source/conf.py.llms_txt_excludematches docnames (extension stripped) viafnmatch.fnmatch— seesphinx_llms_txt/collector.py. A pattern such as**/*.ipynbcan't match because the docname carries no extension. The change enumerates*.ipynbfiles under the source directory at conf-load time and converts each path to its docname (relative to the source dir, suffix stripped, posix separators).Scope:
llms.txt/llms-full.txt. The Sphinx HTML build is governed by the separateexclude_patternslist (line 351 ofconf.py), which is untouched. All 127 notebooks remain fully rendered ondocs.ray.io.exclude_patterns(e.g.serve/tutorials/video-analysis/*.ipynb) aren't built, so adding their docnames tollms_txt_excludeis a harmless no-op.Verification
After RtD builds the PR preview:
llms-full.txtfrom the PR build and confirm no"cell_type": "code"/"output_type": "stream"/ base64 image data appears in the corpus.llms.txt(the summary index) still resolves and looks sane.Context
Tracked under [DOC-908]. Follow-up to #63130 (DOC-875). Part of [DOC-844] (Agent Ray docs umbrella).
Tuning this list further (other low-signal page types) is deferred until the rebuilt corpus can be inspected directly.