Extract using OCR

How to extract data from a file using OCR.

Prerequisites

Install Tesseract and language packs:

# Install Tesseract
sudo apt install tesseract-ocr

# Install language packs (example: German)
sudo apt install tesseract-ocr-deu

Basic OCR Usage

from extractous import Extractor, TesseractOcrConfig

def extract_with_ocr():
    # Configure extractor with OCR settings
    extractor = Extractor()
    extractor.set_ocr_config(
        TesseractOcrConfig().set_language("deu")
    )
    
    # Extract content
    content = extractor.extract_file_to_string("path/to/document.pdf")
    return content

# Usage with error handling
try:
    content = extract_with_ocr()
    print(content)
except Exception as e:
    print(f"Error extracting content: {e}")

Multi-language OCR

from extractous import Extractor, TesseractOcrConfig

def extract_multi_language():
    extractor = Extractor()
    # Configure multiple languages with '+'
    extractor.set_ocr_config(
        TesseractOcrConfig().set_language("eng+deu")
    )
    
    return extractor.extract_file_to_string("path/to/document.pdf")

# Usage with error handling
try:
    content = extract_multi_language()
    print(content)
except Exception as e:
    print(f"Error extracting content: {e}")