Skip to content

feat: idempotent path-keyed indexing + incremental update demo#314

Open
harshrathod0585 wants to merge 2 commits into
VectifyAI:mainfrom
harshrathod0585:feat/incremental-md-update
Open

feat: idempotent path-keyed indexing + incremental update demo#314
harshrathod0585 wants to merge 2 commits into
VectifyAI:mainfrom
harshrathod0585:feat/incremental-md-update

Conversation

@harshrathod0585

Copy link
Copy Markdown

Summary

Indexing was non-idempotent: re-ingesting the same file minted a new UUID and wrote a duplicate <doc_id>.json every time, silently bloating the workspace and orphaning prior summaries.

  • index() now resolves a document by its absolute path and reuses the existing doc_id, overwriting in place.
  • New get_doc_id_by_path() exposes this lookup so callers can cleanly branch: index() when new, update() when known.
  • Adds examples/incremental_update_demo.py demonstrating the index-vs-update flow, a PageIndex-themed sample.md, and an examples/README.md.

Note for maintainers

This makes PageIndex more powerful — and with that, potentially more dangerous: because re-indexing now overwrites a document in place by path, an unintended re-ingest can silently replace an existing tree. Worth a deliberate look at the path-matching semantics before merge.

I also have another plan for PageIndex I'd love to build on top of this. Thanks for the consideration! 🙏

Add PageIndexClient.update(doc_id) for MD docs. Detects changed
sections via a section-hash diff and re-summarizes only the changed
sections plus their ancestors, reusing cached summaries for the rest.

- extract_node_text_content now stamps a hierarchical title_path on
  each node, giving sections a stable identity across edits.
- utils: hash_text, compute_section_hashes, find_ancestors helpers.
- index() stores file_hash + section_hashes for MD docs so update()
  has a baseline; _ensure_doc_loaded restores them on demand.
- update() gates on file_hash, then per-section hashes; returns the
  updated/added/deleted section paths.

Markdown only: its heading structure is parsed deterministically, so
the new tree shape is free and the LLM runs only on changed sections.
Indexing was non-idempotent: re-ingesting the same file minted a new
UUID and wrote a duplicate <doc_id>.json every time, silently bloating
the workspace and orphaning prior summaries.

index() now resolves a document by its absolute path and reuses the
existing doc_id, overwriting in place. New get_doc_id_by_path() exposes
this lookup so callers can cleanly branch: index() when new, update()
when known.

Ships examples/incremental_update_demo.py demonstrating the
index-vs-update flow, a PageIndex-themed sample.md, and an examples
README.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

1 participant