Configuration

Complete configuration options for document extraction.

This page lists all available configuration options for the extraction API. For a quick start guide with minimal configuration, see the Quickstart.

Core Options

strategy

  • Type: string
  • Required: true
  • Default: "FAST_WITH_OCR"
  • Options: "FAST", "FAST_WITH_OCR", "ACCURATE", "ACCURATE_WITH_OCR"

Controls the extraction strategy and accuracy level.

  • FAST: Quick extraction without OCR
  • FAST_WITH_OCR: Quick extraction with OCR for images and scanned documents
  • ACCURATE: Detailed extraction without OCR
  • ACCURATE_WITH_OCR: Detailed extraction with OCR (highest accuracy, slower)

PDF Configuration Options

pdf_config

Configuration options specific to PDF document processing.

ocr_strategy

  • Type: string
  • Default: "AUTO"
  • Options: "AUTO", "FORCE", "DISABLE"

extract_annotation_text

  • Type: boolean
  • Default: true

Extract text from PDF annotations.

extract_inline_images

  • Type: boolean
  • Default: false

Extract inline images from PDF documents.

extract_marked_content

  • Type: boolean
  • Default: false

Extract marked content from PDF documents.

extract_unique_inline_images_only

  • Type: boolean
  • Default: false

Only extract unique inline images, avoiding duplicates.

OCR Configuration Options

ocr_config

Configuration options for OCR (Optical Character Recognition) processing.

apply_rotation

  • Type: boolean
  • Default: false

Automatically rotate images for optimal OCR processing.

density

  • Type: number
  • Default: 300

Image resolution in DPI for OCR processing.

depth

  • Type: number
  • Default: 8

Color depth for image processing.

enable_image_preprocessing

  • Type: boolean
  • Default: false

Enable image preprocessing for improved OCR results.

language

  • Type: string
  • Default: "eng"

Language used for OCR processing.

timeout_seconds

  • Type: number
  • Default: 120

Maximum time in seconds for OCR processing.

Office Configuration Options

office_config

Configuration options for processing Microsoft Office documents.

concatenate_phonetic_runs

  • Type: boolean
  • Default: true

Combine phonetic runs in text extraction.

extract_all_alternatives_from_msg

  • Type: boolean
  • Default: false

Extract all alternative content from MSG files.

extract_macros

  • Type: boolean
  • Default: false

Extract macros from Office documents.

include_deleted_content

  • Type: boolean
  • Default: false

Include deleted/revised content in extraction.

include_headers_and_footers

  • Type: boolean
  • Default: true

Include headers and footers in extracted content.

include_missing_rows

  • Type: boolean
  • Default: false

Include empty/missing rows in table extraction.

include_move_from_content

  • Type: boolean
  • Default: false

Include moved content in extraction results.

include_shape_based_content

  • Type: boolean
  • Default: true

Include content from shapes and text boxes.

include_slide_master_content

  • Type: boolean
  • Default: true

Include slide master content in PowerPoint extractions.

include_slide_notes

  • Type: boolean
  • Default: true

Include slide notes in PowerPoint extractions.

Example Request

These are example requests using all available configuration options. You can omit any options you don't need.

curl
python
javascript
rust
curl --request POST \
  --url https://api.extractous.com/v1/extract \
  --header 'Content-Type: multipart/form-data' \
  --header 'X-Api-Key: YOUR_API_KEY' \
  --form file='PATH_TO_YOUR_FILE' \
  --form 'config[strategy]=FAST_WITH_OCR' \
  --form 'config[pdf_config][ocr_strategy]=AUTO' \
  --form 'config[pdf_config][extract_annotation_text]=true' \
  --form 'config[pdf_config][extract_inline_images]=false' \
  --form 'config[pdf_config][extract_marked_content]=false' \
  --form 'config[pdf_config][extract_unique_inline_images_only]=false' \
  --form 'config[ocr_config][apply_rotation]=false' \
  --form 'config[ocr_config][density]=300' \
  --form 'config[ocr_config][depth]=8' \
  --form 'config[ocr_config][enable_image_preprocessing]=false' \
  --form 'config[ocr_config][language]=eng' \
  --form 'config[ocr_config][timeout_seconds]=120' \
  --form 'config[office_config][concatenate_phonetic_runs]=true' \
  --form 'config[office_config][extract_all_alternatives_from_msg]=false' \
  --form 'config[office_config][extract_macros]=false' \
  --form 'config[office_config][include_deleted_content]=false' \
  --form 'config[office_config][include_headers_and_footers]=true' \
  --form 'config[office_config][include_missing_rows]=false' \
  --form 'config[office_config][include_move_from_content]=false' \
  --form 'config[office_config][include_shape_based_content]=true' \
  --form 'config[office_config][include_slide_master_content]=true' \
  --form 'config[office_config][include_slide_notes]=true'