iscc-tika offers a fast and efficient solution for extracting content and metadata from various document types such as PDF, Word, HTML, and many other formats. A comprehensive solution in Rust with Python bindings.
Note: This project is a fork of yobix-ai/extractous, revived and maintained by the ISCC Foundation.
- Built with Rust: The core is developed in Rust, leveraging its high performance, memory safety, multi-threading capabilities, and zero-cost abstractions.
- Extended format support with Apache Tika: For file formats not natively supported by the Rust core, we compile the well-known Apache Tika into native shared libraries using GraalVM ahead-of-time compilation technology. These shared libraries are then linked to and called from our Rust core. No local servers, no virtual machines, or any garbage collection, just pure native execution.
- Python bindings: Python binding is a wrapper around the Rust core with the potential to circumvent the Python GIL limitation and make efficient use of multi-cores.
- High-performance unstructured data extraction optimized for speed and low memory usage.
- Clear and simple API for extracting text and metadata content.
- Automatically identifies document types and extracts content accordingly.
- Supports many file formats (most formats supported by Apache Tika).
- Extracts text from images and scanned documents with OCR through tesseract-ocr.
- Core engine written in Rust with bindings for Python.
- Free for Commercial Use: Apache 2.0 License.
- Extract a file content to a string:
from iscc_tika import Extractor
# Create a new extractor
extractor = Extractor()
extractor = extractor.set_extract_string_max_length(1000)
# if you need an xml
# extractor = extractor.set_xml_output(True)
# Extract text from a file
result, metadata = extractor.extract_file_to_string("README.md")
print(result)
print(metadata)- Extract only document metadata (faster for text-heavy PDFs, skips full text extraction):
from iscc_tika import Extractor
extractor = Extractor()
# Returns just the metadata dict โ no text extraction overhead.
metadata = extractor.extract_file_metadata("README.md")
print(metadata.get("Content-Type"))
# Also available for bytes and URLs:
# metadata = extractor.extract_bytes_metadata(buffer)
# metadata = extractor.extract_url_metadata("https://www.google.com/")Metadata is best-effort: keys that Tika only populates during content iteration (e.g.
pdf:charsPerPage) may be absent. Core identification metadata (Content-Type, Dublin Core fields,
Office core properties) is present.
- Extracting a file(URL / bytearray) to a buffered stream:
from iscc_tika import Extractor
extractor = Extractor()
# if you need an xml
# extractor = extractor.set_xml_output(True)
# for file
reader, metadata = extractor.extract_file("tests/quarkus.pdf")
# for url
# reader, metadata = extractor.extract_url("https://www.google.com")
# for bytearray
# with open("tests/quarkus.pdf", "rb") as file:
# buffer = bytearray(file.read())
# reader, metadata = extractor.extract_bytes(buffer)
result = ""
buffer = reader.read(4096)
while len(buffer) > 0:
result += buffer.decode("utf-8")
buffer = reader.read(4096)
print(result)
print(metadata)- Extracting a file with OCR:
You need to have Tesseract installed with the language pack. For example on debian
sudo apt install tesseract-ocr tesseract-ocr-deu
from iscc_tika import Extractor, TesseractOcrConfig
extractor = Extractor().set_ocr_config(TesseractOcrConfig().set_language("deu"))
result, metadata = extractor.extract_file_to_string(
"../../test_files/documents/eng-ocr.pdf"
)
print(result)
print(metadata)- Extract a file content to a string:
use iscc_tika::Extractor;
fn main() {
// Create a new extractor. Note it uses a consuming builder pattern
let mut extractor = Extractor::new().set_extract_string_max_length(1000);
// if you need an xml
// extractor = extractor.set_xml_output(true);
// Extract text from a file
let (text, metadata) = extractor.extract_file_to_string("README.md").unwrap();
println!("{}", text);
println!("{:?}", metadata);
}- Extract a content of a file(URL/ bytes) to a
StreamReaderand perform buffered reading
use std::io::{BufReader, Read};
// use std::fs::File; use for bytes
use iscc_tika::Extractor;
fn main() {
// Get the command-line arguments
let args: Vec<String> = std::env::args().collect();
let file_path = &args[1];
// Extract the provided file content to a string
let extractor = Extractor::new();
// if you need an xml
// extractor = extractor.set_xml_output(true);
let (stream, metadata) = extractor.extract_file(file_path).unwrap();
// Extract url
// let (stream, metadata) = extractor.extract_url("https://www.google.com/").unwrap();
// Extract bytes
// let mut file = File::open(file_path)?;
// let mut buffer = Vec::new();
// file.read_to_end(&mut buffer)?;
// let (stream, metadata) = extractor.extract_bytes(&file_bytes);
// Because stream implements std::io::Read trait we can perform buffered reading
// For example we can use it to create a BufReader
let mut reader = BufReader::new(stream);
let mut buffer = Vec::new();
reader.read_to_end(&mut buffer).unwrap();
println!("{}", String::from_utf8(buffer).unwrap());
println!("{:?}", metadata);
}- Extract content of PDF with OCR.
You need to have Tesseract installed with the language pack. For example on debian
sudo apt install tesseract-ocr tesseract-ocr-deu
use iscc_tika::Extractor;
fn main() {
let file_path = "../test_files/documents/deu-ocr.pdf";
let extractor = Extractor::new()
.set_ocr_config(TesseractOcrConfig::new().set_language("deu"))
.set_pdf_config(PdfParserConfig::new().set_ocr_strategy(PdfOcrStrategy::OCR_ONLY));
// extract file with extractor
let (content, metadata) = extractor.extract_file_to_string(file_path).unwrap();
println!("{}", content);
println!("{:?}", metadata);
}| Category | Supported Formats | Notes |
|---|---|---|
| Microsoft Office | DOC, DOCX, PPT, PPTX, XLS, XLSX, RTF | Includes legacy and modern Office file formats |
| OpenOffice | ODT, ODS, ODP | OpenDocument formats |
| Can extracts embedded content and supports OCR | ||
| Spreadsheets | CSV, TSV | Plain text spreadsheet formats |
| Web Documents | HTML, XML | Parses and extracts content from web documents |
| E-Books | EPUB | EPUB format for electronic books |
| Text Files | TXT, Markdown | Plain text formats |
| Images | PNG, JPEG, TIFF, BMP, GIF, ICO, PSD, SVG | Extracts embedded text with OCR |
| EML, MSG, MBOX, PST | Extracts content, headers, and attachments |
- Rust (stable toolchain)
- uv (Python package manager)
- mise (task runner)
- GraalVM / JDK 17+ (for building tika-native)
Install dev tools and git hooks:
uv tool install prek ruff taplo yamlfix
uv tool install mdformat --with 'mdformat-mkdocs[recommended]'
prek install && prek install --hook-type pre-pushRun tasks with mise run <task>:
| Task | Description |
|---|---|
check |
Run all pre-commit hooks on all files |
lint |
Run format checks, clippy, and ruff |
format |
Run pre-commit auto-fix hooks |
test |
Run all tests (cargo test + pytest) |
version:check |
Check that core and Python package versions match |
pr:main |
Create a PR from develop to main |
Quality gates are enforced automatically via git hooks:
- Pre-commit (fast, auto-fix): file hygiene, markdown formatting, cargo-fmt, ruff, taplo, yamlfix
- Pre-push (thorough): cargo-clippy, cargo-test, security scan, complexity check, pytest
Contributions are welcome! Please open an issue or submit a pull request if you have any improvements or new features to propose.
This project is licensed under the Apache License 2.0. See the LICENSE file for details.