NetExtract is crafted to extract core content from webpages and convert it into clean, LLM-friendly text. Leveraging the power of Express.js, TypeScript, and Puppeteer, it offers a streamlined API for efficient content extraction and transformation, making it an invaluable tool for enhancing LLM and RAG systems with up-to-date web information and API web scraping.
- Core Content Extraction: Seamlessly extracts essential content from any URL.
- Markdown Conversion: Converts webpage content into clean, well-formatted Markdown.
- Social Media Scraping: Efficiently scrapes and formats X (Twitter) posts.
- Simple API Integration: Easily integrates with existing systems.
- LLM-Powered Conversion: Utilizes open-source large language models to enhance the extraction and conversion process, ensuring high-quality output.
To use NetExtract, prepend the API endpoint to your desired URL:
http://{your_address}/api?url={url}git clone https://github.com/sabber-slt/NetExtract
cd NetExtractThen run the application with Docker:
docker compose up -d- Inspired by jina.ai
- Built with Node.js, Express.js, TypeScript, and Puppeteer
.
βββ cookie
β βββ twitter.json # Twitter cookie for X (Twitter) post scraping
βββ docs # Documentation files
βββ search # Searxng engine
βββ src # Source code
β βββ interfaces # TypeScript interfaces
β βββ lib # Utility libraries
β βββ routes # Express route handlers
β βββ services # Core service layer for business logic
β βββ utils # Helper functions and utilities
β βββ app.ts # Main application entry point
βββ .env # Environment variables
βββ .gitignore # Git ignored files
βββ .prettierignore # Prettier ignored files
βββ .prettierrc.js # Prettier configuration
βββ app.log # Log file
βββ Dockerfile # Dockerfile
βββ docker-compose.yaml # Docker Compose configuration
βββ package.json # Node.js project metadata
βββ README.md # Project README
βββ tsconfig.json # TypeScript configuration
βββ yarn.lock # Yarn lockfile for dependency management
I welcome and appreciate contributions! If you'd like to contribute, please feel free to submit issues, fork the repository, and send pull requests.
