Introduction
extractous is the fast and resource efficient data extraction tool for building great AI, RAG and ML applications.
Features Copied!
-
High-performance unstructured data extraction optimized for speed and low memory usage.
-
Clear and simple API for extracting text and metadata content.
-
Automatically identifies document types and extracts content accordingly
-
Supports many file formats (most formats supported by Apache Tika).
-
Extracts text from images and scanned documents with OCR through tesseract-ocr .
-
Core engine written in Rust with bindings for Python and upcoming support for JavaScript/TypeScript.
-
Detailed documentation and examples to help you get started quickly and efficiently.
-
Free for Commercial Use: Apache 2.0 License.
Performance Copied!
Extractous is fast. But please don't take our word for it, you can run the benchmarks yourself. Extracting content out of sec10 filings pdf forms , extractous is on average ~18x faster than for example unstructured-io.