Introduction
Extractous is a fast and resource efficient data extraction tool for building great AI, ML and RAG applications.
Features
- High-performance unstructured data extraction optimized for speed and low memory usage.
- Clear and simple API for extracting text and metadata content.
- Automatically identifies document types and extracts content accordingly
- Supports many file formats (most formats supported by Apache Tika).
- Extracts text from images and scanned documents with OCR through tesseract-ocr .
- Core engine written in Rust with bindings for Python and upcoming support for JavaScript/TypeScript.
- Detailed documentation and examples to help you get started quickly and efficiently.
- Free for Commercial Use: Apache 2.0 License.
Performance
Extractous is fast. Extracting content out of SEC10 filings PDFs, extractous is on average ~18x faster than for example unstructured-io. You can run the benchmarks yourself.