extractousGuideReference

Introduction

extractous is the fast and resource efficient data extraction tool for building great AI, RAG and ML applications.

Features Copied!

  • High-performance unstructured data extraction optimized for speed and low memory usage.

  • Clear and simple API for extracting text and metadata content.

  • Automatically identifies document types and extracts content accordingly

  • Supports many file formats (most formats supported by Apache Tika).

  • Extracts text from images and scanned documents with OCR through tesseract-ocr .

  • Core engine written in Rust with bindings for Python and upcoming support for JavaScript/TypeScript.

  • Detailed documentation and examples to help you get started quickly and efficiently.

  • Free for Commercial Use: Apache 2.0 License.

Performance Copied!

Extractous is fast. But please don't take our word for it, you can run the benchmarks yourself. Extracting content out of sec10 filings pdf forms , extractous is on average ~18x faster than for example unstructured-io.