Wiki-RAG is a Retrieval-Augmented Generation (RAG) system designed to enhance the process of querying and generating responses based on the Italian Wikipedia. By combining advanced retrieval techniques with state-of-the-art natural language generation, Wiki-RAG provides precise, contextually relevant answers for Italian-language queries.
- Retrieval-Augmented Generation: Combines document retrieval and language generation to produce coherent, informative responses.
- Italian Wikipedia Focus: Tailored for Italian-language content to ensure culturally and contextually accurate results.
- Dual Vector Store: Utilizes two vector stores:
- Document Vector Store: For semantic search over chunked document texts.
- Keyword Vector Store: For title-based retrieval.
- Local LLM Integration: Interfaces with a local language model (e.g.,
meta-llama-3.1-8b-instruct) via LM Studio.
requirements.txt
Lists all dependencies required for the project.app.py
Streamlit interface for interacting with the Wiki-RAG system.prompt.py
Contains prompts for querying.ModelQuery.py
Handles local model queries via LM Studio.VectorStore.py
Manages initialization and population of vector stores for documents and keywords.search.py
Provides search, retrieval, and re-ranking functionalities.wikipedia_dump_processor.py
Contains theWikipediaDumpProcessorclass for dump extraction and processing (uses WikiExtractor).
- Query Input:
Users submit an Italian query via the Streamlit interface. - Information Retrieval:
The system uses two vector stores:- Document Vector Store: Retrieves relevant document chunks.
- Keyword Vector Store: Retrieves articles based on titles.
- Response Generation:
Retrieved content is passed to a local LLM (via LM Studio) to generate a comprehensive answer.
-
Clone the Repository:
git clone git@github.com:DonPalius/Wiki-RAG.git cd Wiki-RAG -
Install Dependencies:
pip install -r requirements.txt
-
Configure Environment Variables:
Create a.envfile in the project root with the following content (update paths as needed):OPENAI_API_KEY=your_openai_api_key DUMP_FILE=Data/itwiki-latest-pages-articles1.xml-p1p316052.bz2 OUTPUT_DIR=Data BASE_DIR=Data
Wiki-RAG is set up to optionally download and process a Wikipedia dump using WikiExtractor. Update these configuration variables in your environment or code (current version) as needed:
- DUMP_URL:
https://dumps.wikimedia.org/itwiki/latest/itwiki-latest-pages-articles1.xml-p1p316052.bz2 - DUMP_FILE:
The local path to your downloaded dump file (e.g.,Data/itwiki-latest-pages-articles1.xml-p1p316052.bz2). - OUTPUT_DIR:
The directory where extracted files and CSV outputs will be stored (e.g.,Data). - BASE_DIR:
The base directory for extracted files (usually the same as OUTPUT_DIR).
The WikipediaDumpProcessor class in wikipedia_dump_processor.py handles:
- Downloading: Checks if the dump exists locally; if not, downloads it.
- Extraction: Uses WikiExtractor to convert the dump into JSON files.
- Parsing and Chunking: Converts JSON files to a Pandas DataFrame, splits article text into smaller chunks, and saves the results in
Wikipedia.csv.
Wiki-RAG uses LM Studio to interface with a local language model. For example, to use the meta-llama-3.1-8b-instruct model:
-
Download and Install LM Studio:
- Visit LM Studio and install the application.
-
Download the Model:
- Within LM Studio, locate and download the
meta-llama-3.1-8b-instructmodel.
- Within LM Studio, locate and download the
-
Start the LM Studio Local Server:
- Launch LM Studio and load the model. The default endpoint is typically:
http://127.0.0.1:1234/v1/chat/completions
- Launch LM Studio and load the model. The default endpoint is typically:
-
Configure the Model Query:
- In
app.py, the local model is initialized as follows:if 'llm' not in st.session_state: st.session_state.llm = ModelQuery("http://127.0.0.1:1234/v1/chat/completions", "meta-llama-3.1-8b-instruct")
- In
Run the Streamlit interface with:
streamlit run app.pyIn the UI, click the "Scarica e carica dump Wikipedia" button to:
- Download and process the Wikipedia dump.
- Parse, chunk, and update both the document and keyword vector stores.
- The updated data is stored in
Wikipedia.csvand loaded into ChromaDB-backed collections.
-
Wikipedia Dump Pipeline v2:
Download the full dump of italian wikipedia, split in multiple vector and load it only after we find a match in the keyword vectorstore. -
Enhanced UI:
Implement a chat history feature for improved user interaction. -
Improved Text Processing:
Further refine article chunking for better retrieval accuracy.
-
VectorStore:
TheVectorStoreclass inVectorStore.pymanages two separate collections:- Document Collection (
wikipedia_docs): Stores individual text chunks. - Keyword Collection (
wikipedia_keywords): Stores article titles with associated chunk IDs.
- Document Collection (
-
Search Functionality:
TheSearchclass (insearch.py) provides methods for semantic retrieval (from document chunks) and keyword-based retrieval.
By following these instructions, you'll have a fully functional Wiki-RAG system that integrates both document and keyword vector stores, and communicates with a local LLM via LM Studio.