Introduction

Extractous is a fast and resource efficient data extraction tool for building great AI, ML and RAG applications.

Features

  • High-performance unstructured data extraction optimized for speed and low memory usage.
  • Clear and simple API for extracting text and metadata content.
  • Automatically identifies document types and extracts content accordingly
  • Supports many file formats (most formats supported by Apache Tika).
  • Extracts text from images and scanned documents with OCR through tesseract-ocr .
  • Core engine written in Rust with bindings for Python and upcoming support for JavaScript/TypeScript.
  • Detailed documentation and examples to help you get started quickly and efficiently.
  • Free for Commercial Use: Apache 2.0 License.

Performance

Extractous is fast. Extracting content out of SEC10 filings PDFs, extractous is on average ~18x faster than for example unstructured-io. You can run the benchmarks yourself.

License

Apache 2.0