[doc][DOC-908] exclude Jupyter notebooks from llms-full.txt by dstrodtman · Pull Request #63228 · ray-project/ray

dstrodtman · 2026-05-08T13:29:31Z

Why

After #63130 shipped, the generated llms-full.txt corpus is polluted by raw Jupyter notebook source. sphinx-llms-txt reads each docname's source from _sources/ verbatim, so for the 127 .ipynb pages under doc/source/ it appends the full notebook JSON (cells, outputs, embedded base64 images, metadata) into the file. That's the largest source of low-signal bytes in the corpus targeted at agents.

What

Append computed notebook docnames to the existing llms_txt_exclude list in doc/source/conf.py.

llms_txt_exclude matches docnames (extension stripped) via fnmatch.fnmatch — see sphinx_llms_txt/collector.py. A pattern such as **/*.ipynb can't match because the docname carries no extension. The change enumerates *.ipynb files under the source directory at conf-load time and converts each path to its docname (relative to the source dir, suffix stripped, posix separators).

Scope:

Affects only llms.txt / llms-full.txt. The Sphinx HTML build is governed by the separate exclude_patterns list (line 351 of conf.py), which is untouched. All 127 notebooks remain fully rendered on docs.ray.io.
Notebooks already in exclude_patterns (e.g. serve/tutorials/video-analysis/*.ipynb) aren't built, so adding their docnames to llms_txt_exclude is a harmless no-op.

Verification

After RtD builds the PR preview:

Fetch llms-full.txt from the PR build and confirm no "cell_type": "code" / "output_type": "stream" / base64 image data appears in the corpus.
Confirm llms.txt (the summary index) still resolves and looks sane.
Spot-check that a couple of notebook pages still render normally in the HTML preview.

Context

Tracked under [DOC-908]. Follow-up to #63130 (DOC-875). Part of [DOC-844] (Agent Ray docs umbrella).

Tuning this list further (other low-signal page types) is deferred until the rebuilt corpus can be inspected directly.

sphinx-llms-txt reads each docname's source verbatim from `_sources/`, so for `.ipynb` pages it appends raw notebook JSON (cells, outputs, embedded base64 images) into the corpus. `llms_txt_exclude` matches docnames (extension stripped) via fnmatch, so a `**/*.ipynb` pattern can't work. Enumerate notebook docnames at conf-load time and append them. Notebooks remain fully rendered in the HTML build; only the agent corpus drops them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

gemini-code-assist

Code Review

This pull request updates doc/source/conf.py to dynamically exclude all Jupyter notebooks from the llms-full.txt corpus. This is achieved by scanning the documentation directory for .ipynb files and adding their relative docnames to the llms_txt_exclude list, preventing raw JSON content from being included in the LLM agent corpus. I have no feedback to provide.

ronny-anyscale

after while, 🐊

…ect#63228) ## Why After ray-project#63130 shipped, the generated `llms-full.txt` corpus is polluted by raw Jupyter notebook source. `sphinx-llms-txt` reads each docname's source from `_sources/` verbatim, so for the 127 `.ipynb` pages under `doc/source/` it appends the full notebook JSON (cells, outputs, embedded base64 images, metadata) into the file. That's the largest source of low-signal bytes in the corpus targeted at agents. ## What Append computed notebook docnames to the existing `llms_txt_exclude` list in `doc/source/conf.py`. `llms_txt_exclude` matches docnames (extension stripped) via `fnmatch.fnmatch` — see [`sphinx_llms_txt/collector.py`](https://github.com/jdillard/sphinx-llms-txt/blob/main/sphinx_llms_txt/collector.py). A pattern such as `**/*.ipynb` can't match because the docname carries no extension. The change enumerates `*.ipynb` files under the source directory at conf-load time and converts each path to its docname (relative to the source dir, suffix stripped, posix separators). Scope: - Affects only `llms.txt` / `llms-full.txt`. The Sphinx HTML build is governed by the separate `exclude_patterns` list (line 351 of `conf.py`), which is untouched. All 127 notebooks remain fully rendered on `docs.ray.io`. - Notebooks already in `exclude_patterns` (e.g. `serve/tutorials/video-analysis/*.ipynb`) aren't built, so adding their docnames to `llms_txt_exclude` is a harmless no-op. ## Verification After RtD builds the PR preview: - Fetch `llms-full.txt` from the PR build and confirm no `"cell_type": "code"` / `"output_type": "stream"` / base64 image data appears in the corpus. - Confirm `llms.txt` (the summary index) still resolves and looks sane. - Spot-check that a couple of notebook pages still render normally in the HTML preview. ## Context Tracked under [DOC-908]. Follow-up to ray-project#63130 (DOC-875). Part of [DOC-844] (Agent Ray docs umbrella). Tuning this list further (other low-signal page types) is deferred until the rebuilt corpus can be inspected directly. Signed-off-by: Douglas Strodtman <douglas@anyscale.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dstrodtman requested a review from a team as a code owner May 8, 2026 13:29

ray-gardener Bot added docs An issue or change related to documentation core Issues that should be addressed in Ray Core labels May 8, 2026

gemini-code-assist Bot reviewed May 8, 2026

View reviewed changes

ronny-anyscale approved these changes May 8, 2026

View reviewed changes

dstrodtman added the go add ONLY when ready to merge, run all tests label May 8, 2026

elliot-barn merged commit 1b3661b into ray-project:master May 8, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[doc][DOC-908] exclude Jupyter notebooks from llms-full.txt#63228

[doc][DOC-908] exclude Jupyter notebooks from llms-full.txt#63228
elliot-barn merged 1 commit into
ray-project:masterfrom
dstrodtman:doc-908-exclude-notebooks-llms-full

dstrodtman commented May 8, 2026

gemini-code-assist Bot left a comment

ronny-anyscale left a comment

Uh oh!

Labels

3 participants

Uh oh!

Conversation

dstrodtman commented May 8, 2026

Why

What

Verification

Context

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

ronny-anyscale left a comment

Choose a reason for hiding this comment

Uh oh!

Labels

3 participants