feat: idempotent path-keyed indexing + incremental update demo by harshrathod0585 · Pull Request #314 · VectifyAI/PageIndex

harshrathod0585 · 2026-06-02T10:28:33Z

Summary

Indexing was non-idempotent: re-ingesting the same file minted a new UUID and wrote a duplicate <doc_id>.json every time, silently bloating the workspace and orphaning prior summaries.

index() now resolves a document by its absolute path and reuses the existing doc_id, overwriting in place.
New get_doc_id_by_path() exposes this lookup so callers can cleanly branch: index() when new, update() when known.
Adds examples/incremental_update_demo.py demonstrating the index-vs-update flow, a PageIndex-themed sample.md, and an examples/README.md.

Note for maintainers

This makes PageIndex more powerful — and with that, potentially more dangerous: because re-indexing now overwrites a document in place by path, an unintended re-ingest can silently replace an existing tree. Worth a deliberate look at the path-matching semantics before merge.

I also have another plan for PageIndex I'd love to build on top of this. Thanks for the consideration! 🙏

Add PageIndexClient.update(doc_id) for MD docs. Detects changed sections via a section-hash diff and re-summarizes only the changed sections plus their ancestors, reusing cached summaries for the rest. - extract_node_text_content now stamps a hierarchical title_path on each node, giving sections a stable identity across edits. - utils: hash_text, compute_section_hashes, find_ancestors helpers. - index() stores file_hash + section_hashes for MD docs so update() has a baseline; _ensure_doc_loaded restores them on demand. - update() gates on file_hash, then per-section hashes; returns the updated/added/deleted section paths. Markdown only: its heading structure is parsed deterministically, so the new tree shape is free and the LLM runs only on changed sections.

Indexing was non-idempotent: re-ingesting the same file minted a new UUID and wrote a duplicate <doc_id>.json every time, silently bloating the workspace and orphaning prior summaries. index() now resolves a document by its absolute path and reuses the existing doc_id, overwriting in place. New get_doc_id_by_path() exposes this lookup so callers can cleanly branch: index() when new, update() when known. Ships examples/incremental_update_demo.py demonstrating the index-vs-update flow, a PageIndex-themed sample.md, and an examples README.

harshrathod0585 added 2 commits June 2, 2026 15:18

hipvlady mentioned this pull request Jun 16, 2026

Feature Request: Incremental Index Updates for Large Documents #316

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: idempotent path-keyed indexing + incremental update demo#314

feat: idempotent path-keyed indexing + incremental update demo#314
harshrathod0585 wants to merge 2 commits into
VectifyAI:mainfrom
harshrathod0585:feat/incremental-md-update

harshrathod0585 commented Jun 2, 2026

Labels

1 participant