Skip to content

Unstructured.io

What it is

An open-source library and platform for pre-processing and "unstructuring" messy data (PDFs, HTML, Word docs) into AI-ready formats.

What problem it solves

It automates the ingestion of diverse document types, handling complex layouts and extracting clean text and metadata for RAG pipelines.

Where it fits in the stack

Category: Intake & Storage / Data Processing

Typical use cases

  • RAG Pipelines: Extracting text and metadata from varied document sets for vector database ingestion.
  • Data Lake Hydration: Normalizing disparate document formats (PDF, Word, Email) into a standard JSON/Markdown format.
  • Knowledge Graph Construction: Extracting structured elements and relationships from messy documents.

Strengths

  • Broad Format Support: Handles 20+ file types including PDF, HTML, Word, and PowerPoint.
  • Open-Source & Local: Can be run fully offline without data leaving your infrastructure.
  • Layout Awareness: Not just OCR; it understands headers, lists, and tables.

Limitations

  • Resource Intensive: Complex partitioning (especially with vision models) requires significant CPU/GPU.
  • Dependency Heavy: The "all-docs" installation is large and can have version conflicts.
  • Performance Variability: Extraction quality can vary significantly based on the partitioning strategy chosen (fast vs. hi-res).

When to use it

  • When you have a high volume of diverse, messy document types.
  • When data privacy requires local processing of sensitive documents.
  • When you need more than just raw text (e.g., you need to preserve document structure).

When not to use it

  • For very simple text files or clean Markdown where standard readers suffice.
  • If you need real-time, low-latency parsing (it is optimized for batch ETL).

Licensing and cost

  • Open Source: Yes (Apache 2.0)
  • Cost: Free (Self-hosted) / Paid (Unstructured API / Platform)
  • Self-hostable: Yes

Partitioning Strategies

The Unstructured library offers several strategies for preprocessing documents, specified via the strategy parameter.

Strategy Type Best For Trade-offs
auto Hybrid Most documents Default; balances speed and accuracy automatically.
fast Rule-based Plain text / clean PDFs 100x faster than model-based; fails on tables/images.
hi_res Model-based Complex layouts / Tables Highest accuracy for structural elements; slower.
ocr_only Model-based Scanned docs / Images Pure OCR approach; ignores non-image text paths.
vlm Vision-model Challenging/Handwritten Uses Vision Language Models for maximum semantic recovery.

Getting started

Installation

pip install "unstructured[all-docs]"

Basic usage

from unstructured.partition.auto import partition

elements = partition(filename="example.pdf")

for element in elements:
    print(element)

CLI examples

# Process a local directory and output JSON
unstructured-ingest local \
  --input-path example-docs \
  --output-dir unstructured-output \
  --num-processes 2 \
  --recursive \
  --verbose

# Process from S3 (requires [s3] extra)
unstructured-ingest s3 \
  --remote-url s3://my-bucket/documents/ \
  --output-dir s3-output \
  --anonymous \
  --recursive

Python S3 Ingestion Example

import os
from unstructured.ingest.connector.s3 import S3AccessConfig, SimpleS3Config
from unstructured.ingest.interfaces import ProcessorConfig, ReadConfig
from unstructured.ingest.runner import S3Runner

# Set credentials via env vars or S3AccessConfig
os.environ["AWS_ACCESS_KEY_ID"] = "YOUR_KEY"
os.environ["AWS_SECRET_ACCESS_KEY"] = "YOUR_SECRET"

runner = S3Runner(
    processor_config=ProcessorConfig(
        verbose=True,
        output_dir="s3-output",
        num_processes=2,
        reprocess=False # Skip files already processed
    ),
    read_config=ReadConfig(),
    connector_config=SimpleS3Config(
        access_config=S3AccessConfig(),
        remote_url="s3://my-bucket/documents/",
        recursive=True
    ),
)

runner.run()

Advanced Pipeline: Chunking for RAG

from unstructured.partition.pdf import partition_pdf
from unstructured.chunking.title import chunk_by_title

elements = partition_pdf(
    filename="research_paper.pdf",
    strategy="hi_res",
    extract_images_in_pdf=False,
    infer_table_structure=True,
    chunking_strategy="by_title",
    max_characters=1000,
    combine_text_under_n_chars=200
)

# Access clean, structured chunks
for chunk in elements:
    print(f"Type: {chunk.category}")
    print(f"Content: {chunk.text[:50]}...")

API examples

import requests

url = "https://api.unstructured.io/general/v0/general"
headers = {"Accept": "application/json", "unstructured-api-key": "YOUR_API_KEY"}
files = {"files": open("example.pdf", "rb")}

# Add strategy and coordinates parameters
data = {
    "strategy": "hi_res",
    "coordinates": "true"
}

response = requests.post(url, headers=headers, files=files, data=data)
print(response.json())

Sources / references

Contribution Metadata

  • Last reviewed: 2026-05-11
  • Confidence: high