Extract to Stream

How to extract the content from a file (URL/Bytes) to a StreamReader.

The extract_file method returns a stream implementing std::io::Read, allowing memory-efficient processing of large documents.

Basic Usage

use std::io::{BufReader, Read};
use extractous::Extractor;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let extractor = Extractor::new();
    let stream = extractor.extract_file("path/to/document.pdf")?;
    
    // Create a buffered reader
    let mut reader = BufReader::new(stream);
    let mut buffer = Vec::new();
    reader.read_to_end(&mut buffer)?;
    
    // Convert to string if needed
    let content = String::from_utf8(buffer)?;
    println!("{}", content);
    Ok(())
}

Chunk Processing

Process the content in chunks for memory efficiency:

use std::io::{BufReader, Read};
use extractous::Extractor;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let extractor = Extractor::new();
    let stream = extractor.extract_file("path/to/document.pdf")?;
    let mut reader = BufReader::new(stream);

    let mut buffer = [0; 1024]; // 1KB chunks
    loop {
        match reader.read(&mut buffer)? {
            0 => break, // EOF
            n => {
                // Process chunk of size n
                let chunk = &buffer[..n];
                // Your processing logic here
            }
        }
    }
    Ok(())
}

Configuration

The stream extraction supports the same configuration options as extract_file_to_string:

use extractous::Extractor;
use extractous::PdfParserConfig;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let extractor = Extractor::new()
        .set_pdf_config(
            PdfParserConfig::new()
                .set_extract_annotation_text(false)
        );
        
    let stream = extractor.extract_file("path/to/document.pdf")?;
    // Process stream...
    Ok(())
}

Extract to String

How to extract content from a file to a string.

Extract using OCR

How to extract data from documents that require OCR

On This Page

Star on GitHub Create an issue