Skip to content

PageIndex

What it is

PageIndex is a vectorless, reasoning-based RAG framework that builds hierarchical tree indices from long documents. It enables human-like retrieval by allowing LLMs to reason over document structure instead of relying on traditional vector similarity.

What problem it solves

It addresses the inaccuracies of vector similarity search in professional documents where semantic similarity does not always equal relevance. By simulating how human experts navigate complex PDFs, PageIndex provides higher precision (98.7% on FinanceBench) and better explainability for domain-specific retrieval.

Where it fits in the stack

Tool / Agent: It acts as a specialized retrieval tool and an agentic framework for document navigation.

Technical Capabilities

  • Hierarchical Tree Indexing: Converts documents into semantic TOC-like trees for structured navigation.
  • Vision-Aware Retrieval: Supports multimodal analysis for documents with complex charts and tables.
  • Reasoning-Based Search: Uses LLM logic to decide which sections are most relevant to a query.
  • Vector-Free RAG: Operates without the need for embedding generation or vector database management.

Typical use cases

  • Professional Analysis: Analyzing SEC filings, insurance policies, or dense textbooks where precise section retrieval is required.
  • Tree-based Navigation: Managing very long documents that exceed standard context limits by navigating a semantic "Table of Contents" tree.
  • Vision-based Retrieval: Performing RAG directly on page images for documents where OCR is unreliable or layouts are highly visual.

Strengths

  • No Vector DB Required: Eliminates the overhead and performance bottlenecks of vector indices.
  • No Artificial Chunking: Preserves natural document hierarchy and context.
  • High Explainability: Retrieval steps are based on explicit reasoning and provide clear references.
  • Superior Accuracy: State-of-the-art performance on benchmarks like FinanceBench.

Limitations

  • LLM Cost/Latency: Heavy reliance on multiple LLM reasoning calls can increase operational costs and latency.
  • Model Optimization: Currently primarily optimized for high-end models like GPT-4o.

When to use it

  • When working with high-value professional documents where retrieval precision is paramount.
  • When you need traceable, interpretable evidence for model answers.
  • When document structure (headings, sections) is a strong signal for relevance.

When not to use it

  • For simple semantic searches where "vibe-based" retrieval is sufficient.
  • When extremely low latency is required for a massive number of concurrent queries.

Getting started

Installation

# Clone the repository
git clone https://github.com/VectifyAI/PageIndex.git
cd PageIndex

# Install dependencies
pip install -r requirements.txt

Basic usage

# Create a .env with your OPENAI_API_KEY
# Run the PageIndex structure generation
python run_pageindex.py --pdf_path example.pdf

CLI examples

# Generate tree structure for a local PDF
python run_pageindex.py --pdf_path document.pdf

# Generate structure for a Markdown file
python run_pageindex.py --md_path document.md

# Customize extraction (max pages per node)
python run_pageindex.py --pdf_path doc.pdf --max-pages-per-node 5

API & MCP Configuration

PageIndex provides an MCP server for seamless integration with AI agents like Claude.

MCP Configuration Example (claude_desktop_config.json)

{
  "mcpServers": {
    "pageindex": {
      "command": "npx",
      "args": [
        "-y",
        "@vectify/pageindex-mcp"
      ],
      "env": {
        "PAGEINDEX_API_KEY": "your_api_key_here",
        "OPENAI_API_KEY": "your_openai_key_here"
      }
    }
  }
}

Sources / references

Contribution Metadata

  • Confidence: high
  • Last reviewed: 2026-05-16