Skip to content

[doc][DOC-908] exclude Jupyter notebooks from llms-full.txt#63228

Merged
elliot-barn merged 1 commit into
ray-project:masterfrom
dstrodtman:doc-908-exclude-notebooks-llms-full
May 8, 2026
Merged

[doc][DOC-908] exclude Jupyter notebooks from llms-full.txt#63228
elliot-barn merged 1 commit into
ray-project:masterfrom
dstrodtman:doc-908-exclude-notebooks-llms-full

Conversation

@dstrodtman

Copy link
Copy Markdown
Contributor

Why

After #63130 shipped, the generated llms-full.txt corpus is polluted by raw Jupyter notebook source. sphinx-llms-txt reads each docname's source from _sources/ verbatim, so for the 127 .ipynb pages under doc/source/ it appends the full notebook JSON (cells, outputs, embedded base64 images, metadata) into the file. That's the largest source of low-signal bytes in the corpus targeted at agents.

What

Append computed notebook docnames to the existing llms_txt_exclude list in doc/source/conf.py.

llms_txt_exclude matches docnames (extension stripped) via fnmatch.fnmatch — see sphinx_llms_txt/collector.py. A pattern such as **/*.ipynb can't match because the docname carries no extension. The change enumerates *.ipynb files under the source directory at conf-load time and converts each path to its docname (relative to the source dir, suffix stripped, posix separators).

Scope:

  • Affects only llms.txt / llms-full.txt. The Sphinx HTML build is governed by the separate exclude_patterns list (line 351 of conf.py), which is untouched. All 127 notebooks remain fully rendered on docs.ray.io.
  • Notebooks already in exclude_patterns (e.g. serve/tutorials/video-analysis/*.ipynb) aren't built, so adding their docnames to llms_txt_exclude is a harmless no-op.

Verification

After RtD builds the PR preview:

  • Fetch llms-full.txt from the PR build and confirm no "cell_type": "code" / "output_type": "stream" / base64 image data appears in the corpus.
  • Confirm llms.txt (the summary index) still resolves and looks sane.
  • Spot-check that a couple of notebook pages still render normally in the HTML preview.

Context

Tracked under [DOC-908]. Follow-up to #63130 (DOC-875). Part of [DOC-844] (Agent Ray docs umbrella).

Tuning this list further (other low-signal page types) is deferred until the rebuilt corpus can be inspected directly.

sphinx-llms-txt reads each docname's source verbatim from `_sources/`,
so for `.ipynb` pages it appends raw notebook JSON (cells, outputs,
embedded base64 images) into the corpus. `llms_txt_exclude` matches
docnames (extension stripped) via fnmatch, so a `**/*.ipynb` pattern
can't work. Enumerate notebook docnames at conf-load time and append
them. Notebooks remain fully rendered in the HTML build; only the
agent corpus drops them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
@dstrodtman dstrodtman requested a review from a team as a code owner May 8, 2026 13:29
@ray-gardener ray-gardener Bot added docs An issue or change related to documentation core Issues that should be addressed in Ray Core labels May 8, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates doc/source/conf.py to dynamically exclude all Jupyter notebooks from the llms-full.txt corpus. This is achieved by scanning the documentation directory for .ipynb files and adding their relative docnames to the llms_txt_exclude list, preventing raw JSON content from being included in the LLM agent corpus. I have no feedback to provide.

@ronny-anyscale ronny-anyscale left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after while, 🐊

@dstrodtman dstrodtman added the go add ONLY when ready to merge, run all tests label May 8, 2026
@elliot-barn elliot-barn merged commit 1b3661b into ray-project:master May 8, 2026
9 checks passed
Lucas61000 pushed a commit to Lucas61000/ray that referenced this pull request May 15, 2026
…ect#63228)

## Why

After ray-project#63130 shipped, the generated `llms-full.txt` corpus is polluted
by raw Jupyter notebook source. `sphinx-llms-txt` reads each docname's
source from `_sources/` verbatim, so for the 127 `.ipynb` pages under
`doc/source/` it appends the full notebook JSON (cells, outputs,
embedded base64 images, metadata) into the file. That's the largest
source of low-signal bytes in the corpus targeted at agents.

## What

Append computed notebook docnames to the existing `llms_txt_exclude`
list in `doc/source/conf.py`.

`llms_txt_exclude` matches docnames (extension stripped) via
`fnmatch.fnmatch` — see
[`sphinx_llms_txt/collector.py`](https://github.com/jdillard/sphinx-llms-txt/blob/main/sphinx_llms_txt/collector.py).
A pattern such as `**/*.ipynb` can't match because the docname carries
no extension. The change enumerates `*.ipynb` files under the source
directory at conf-load time and converts each path to its docname
(relative to the source dir, suffix stripped, posix separators).

Scope:

- Affects only `llms.txt` / `llms-full.txt`. The Sphinx HTML build is
governed by the separate `exclude_patterns` list (line 351 of
`conf.py`), which is untouched. All 127 notebooks remain fully rendered
on `docs.ray.io`.
- Notebooks already in `exclude_patterns` (e.g.
`serve/tutorials/video-analysis/*.ipynb`) aren't built, so adding their
docnames to `llms_txt_exclude` is a harmless no-op.

## Verification

After RtD builds the PR preview:

- Fetch `llms-full.txt` from the PR build and confirm no `"cell_type":
"code"` / `"output_type": "stream"` / base64 image data appears in the
corpus.
- Confirm `llms.txt` (the summary index) still resolves and looks sane.
- Spot-check that a couple of notebook pages still render normally in
the HTML preview.

## Context

Tracked under [DOC-908]. Follow-up to ray-project#63130 (DOC-875). Part of
[DOC-844] (Agent Ray docs umbrella).

Tuning this list further (other low-signal page types) is deferred until
the rebuilt corpus can be inspected directly.

Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Issues that should be addressed in Ray Core docs An issue or change related to documentation go add ONLY when ready to merge, run all tests

3 participants