Apache Tika¶
The Apache Tika toolkit detects and extracts metadata and text from over a thousand different file types.
Description¶
It is useful for content analysis, search indexing, and automated document processing. It provides a single interface for parsing all these file types, making it a cornerstone for many search engines and content analysis tools.
When to use it¶
- When you need to extract text and metadata from a wide variety of file formats (PPT, XLS, PDF, etc.).
- For search engine indexing where you need consistent output across different document types.
- For content analysis, translation, and language detection tasks.
- When building automated document processing pipelines.
When not to use it¶
- If you only need to process a single, specific format and a lightweight, specialized library exists (e.g., just plain text).
- In extremely memory-constrained environments, as the Java-based Tika server can be resource-intensive.
- If you require the highest possible OCR accuracy and speed, specialized OCR engines may be more appropriate (though Tika integrates with Tesseract).
Links¶
Alternatives¶
- Textract (AWS)
- Unstructured.io
- Pandoc (mainly for document conversion)
Getting started¶
Docker installation¶
The easiest way to run Tika Server is via Docker:
docker run -d -p 9998:9998 --name tika apache/tika
Tika will be available at http://localhost:9998.
Java Application (CLI)¶
You can also use the Tika application jar for local processing without a server:
# Download the latest tika-app jar
curl -O https://archive.apache.org/dist/tika/2.9.2/tika-app-2.9.2.jar
# Run it against a file
java -jar tika-app-2.9.2.jar --text document.pdf
Hello World¶
- Ensure the Tika Docker container is running:
docker ps | grep tika - Create a test file:
echo "Hello Tika World" > test.txt - Send it to Tika's REST API:
curl -T test.txt http://localhost:9998/tika - Tika should return the extracted text:
Hello Tika World
CLI examples¶
You can use the Tika Server's REST API via curl or use the tika-app JAR for command-line processing.
# Extract text from a local PDF file using Tika Server
curl -T document.pdf http://localhost:9998/tika
# Extract metadata from a local document in JSON format
curl -H "Accept: application/json" -T document.docx http://localhost:9998/meta
# Detect the MIME type of a file
curl -T image.png http://localhost:9998/detect/stream
# List all available parsers (using tika-app)
java -jar tika-app.jar --list-parsers
API examples¶
You can interact with the Tika Server using any HTTP client.
Python (using requests)¶
import requests
url = "http://localhost:9998/tika"
headers = {"Accept": "text/plain"}
with open("document.pdf", "rb") as f:
response = requests.put(url, data=f, headers=headers)
print(response.text)
Recursive Metadata (curl)¶
The /rmeta endpoint is powerful for getting metadata and text from both the container and all embedded documents (e.g., images inside a Word doc).
curl -T document.docx http://localhost:9998/rmeta/text --header "Accept: application/json"
Backlog¶
- Integrate with n8n for automated PDF-to-Markdown conversion.
Contribution Metadata¶
- Confidence: high
- Last reviewed: 2026-03-02
Sources / References¶
- https://tika.apache.org/
- https://aws.amazon.com/textract/
- https://unstructured.io/
- https://tika.apache.org/2.9.2/gettingstarted.html
- https://cwiki.apache.org/confluence/display/TIKA/TikaServer