iscc-tika

iscc-tika offers a fast and efficient solution for extracting content and metadata from various document types such as PDF, Word, HTML, and many other formats. A comprehensive solution in Rust with Python bindings.

Note: This project is a fork of yobix-ai/extractous, revived and maintained by the ISCC Foundation.

Why iscc-tika?

Built with Rust: The core is developed in Rust, leveraging its high performance, memory safety, multi-threading capabilities, and zero-cost abstractions.
Extended format support with Apache Tika: For file formats not natively supported by the Rust core, we compile the well-known Apache Tika into native shared libraries using GraalVM ahead-of-time compilation technology. These shared libraries are then linked to and called from our Rust core. No local servers, no virtual machines, or any garbage collection, just pure native execution.
Python bindings: Python binding is a wrapper around the Rust core with the potential to circumvent the Python GIL limitation and make efficient use of multi-cores.

🌳 Key Features

High-performance unstructured data extraction optimized for speed and low memory usage.
Clear and simple API for extracting text and metadata content.
Automatically identifies document types and extracts content accordingly.
Supports many file formats (most formats supported by Apache Tika).
Extracts text from images and scanned documents with OCR through tesseract-ocr.
Core engine written in Rust with bindings for Python.
Free for Commercial Use: Apache 2.0 License.

🚀 Quickstart

Python

Extract a file content to a string:

from iscc_tika import Extractor

# Create a new extractor
extractor = Extractor()
extractor = extractor.set_extract_string_max_length(1000)
# if you need an xml
# extractor = extractor.set_xml_output(True)

# Extract text from a file
result, metadata = extractor.extract_file_to_string("README.md")
print(result)
print(metadata)

Extract only document metadata (faster for text-heavy PDFs, skips full text extraction):

from iscc_tika import Extractor

extractor = Extractor()

# Returns just the metadata dict — no text extraction overhead.
metadata = extractor.extract_file_metadata("README.md")
print(metadata.get("Content-Type"))

# Also available for bytes and URLs:
# metadata = extractor.extract_bytes_metadata(buffer)
# metadata = extractor.extract_url_metadata("https://www.google.com/")

Metadata is best-effort: keys that Tika only populates during content iteration (e.g. pdf:charsPerPage) may be absent. Core identification metadata (Content-Type, Dublin Core fields, Office core properties) is present.

Extracting a file(URL / bytearray) to a buffered stream:

from iscc_tika import Extractor

extractor = Extractor()
# if you need an xml
# extractor = extractor.set_xml_output(True)

# for file
reader, metadata = extractor.extract_file("tests/quarkus.pdf")
# for url
# reader, metadata = extractor.extract_url("https://www.google.com")
# for bytearray
# with open("tests/quarkus.pdf", "rb") as file:
#     buffer = bytearray(file.read())
# reader, metadata = extractor.extract_bytes(buffer)

result = ""
buffer = reader.read(4096)
while len(buffer) > 0:
    result += buffer.decode("utf-8")
    buffer = reader.read(4096)

print(result)
print(metadata)

Extracting a file with OCR:

You need to have Tesseract installed with the language pack. For example on debian sudo apt install tesseract-ocr tesseract-ocr-deu

from iscc_tika import Extractor, TesseractOcrConfig

extractor = Extractor().set_ocr_config(TesseractOcrConfig().set_language("deu"))
result, metadata = extractor.extract_file_to_string(
    "../../test_files/documents/eng-ocr.pdf"
)

print(result)
print(metadata)

Rust

Extract a file content to a string:

use iscc_tika::Extractor;

fn main() {
    // Create a new extractor. Note it uses a consuming builder pattern
    let mut extractor = Extractor::new().set_extract_string_max_length(1000);
    // if you need an xml
    // extractor = extractor.set_xml_output(true);

    // Extract text from a file
    let (text, metadata) = extractor.extract_file_to_string("README.md").unwrap();
    println!("{}", text);
    println!("{:?}", metadata);
}

Extract a content of a file(URL/ bytes) to a StreamReader and perform buffered reading

use std::io::{BufReader, Read};
// use std::fs::File; use for bytes
use iscc_tika::Extractor;

fn main() {
    // Get the command-line arguments
    let args: Vec<String> = std::env::args().collect();
    let file_path = &args[1];

    // Extract the provided file content to a string
    let extractor = Extractor::new();
    // if you need an xml
    // extractor = extractor.set_xml_output(true);

    let (stream, metadata) = extractor.extract_file(file_path).unwrap();
    // Extract url
    // let (stream, metadata) = extractor.extract_url("https://www.google.com/").unwrap();
    // Extract bytes
    // let mut file = File::open(file_path)?;
    // let mut buffer = Vec::new();
    // file.read_to_end(&mut buffer)?;
    // let (stream, metadata) = extractor.extract_bytes(&file_bytes);

    // Because stream implements std::io::Read trait we can perform buffered reading
    // For example we can use it to create a BufReader
    let mut reader = BufReader::new(stream);
    let mut buffer = Vec::new();
    reader.read_to_end(&mut buffer).unwrap();

    println!("{}", String::from_utf8(buffer).unwrap());
    println!("{:?}", metadata);
}

Extract content of PDF with OCR.

You need to have Tesseract installed with the language pack. For example on debian sudo apt install tesseract-ocr tesseract-ocr-deu

use iscc_tika::Extractor;

fn main() {
  let file_path = "../test_files/documents/deu-ocr.pdf";

    let extractor = Extractor::new()
          .set_ocr_config(TesseractOcrConfig::new().set_language("deu"))
          .set_pdf_config(PdfParserConfig::new().set_ocr_strategy(PdfOcrStrategy::OCR_ONLY));
    // extract file with extractor
  let (content, metadata) = extractor.extract_file_to_string(file_path).unwrap();
  println!("{}", content);
  println!("{:?}", metadata);
}

📄 Supported file formats

Category	Supported Formats	Notes
Microsoft Office	DOC, DOCX, PPT, PPTX, XLS, XLSX, RTF	Includes legacy and modern Office file formats
OpenOffice	ODT, ODS, ODP	OpenDocument formats
PDF	PDF	Can extracts embedded content and supports OCR
Spreadsheets	CSV, TSV	Plain text spreadsheet formats
Web Documents	HTML, XML	Parses and extracts content from web documents
E-Books	EPUB	EPUB format for electronic books
Text Files	TXT, Markdown	Plain text formats
Images	PNG, JPEG, TIFF, BMP, GIF, ICO, PSD, SVG	Extracts embedded text with OCR
E-Mail	EML, MSG, MBOX, PST	Extracts content, headers, and attachments

Development

Prerequisites

Rust (stable toolchain)
uv (Python package manager)
mise (task runner)
GraalVM / JDK 17+ (for building tika-native)

Setup

Install dev tools and git hooks:

uv tool install prek ruff taplo yamlfix
uv tool install mdformat --with 'mdformat-mkdocs[recommended]'
prek install && prek install --hook-type pre-push

Tasks

Run tasks with mise run <task>:

Task	Description
`check`	Run all pre-commit hooks on all files
`lint`	Run format checks, clippy, and ruff
`format`	Run pre-commit auto-fix hooks
`test`	Run all tests (cargo test + pytest)
`version:check`	Check that core and Python package versions match
`pr:main`	Create a PR from develop to main

Quality Gates

Quality gates are enforced automatically via git hooks:

Pre-commit (fast, auto-fix): file hygiene, markdown formatting, cargo-fmt, ruff, taplo, yamlfix
Pre-push (thorough): cargo-clippy, cargo-test, security scan, complexity check, pytest

🤝 Contributing

Contributions are welcome! Please open an issue or submit a pull request if you have any improvements or new features to propose.

🕮 License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 357 Commits
.cargo		.cargo
.claude/skills/release		.claude/skills/release
.devcontainer		.devcontainer
.github/workflows		.github/workflows
bindings/iscc-tika-python		bindings/iscc-tika-python
docs		docs
iscc-tika-core		iscc-tika-core
test_files		test_files
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
mise.toml		mise.toml
ruff.toml		ruff.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

iscc-tika

Why iscc-tika?

🌳 Key Features

🚀 Quickstart

Python

Rust

📄 Supported file formats

Development

Prerequisites

Setup

Tasks

Quality Gates

🤝 Contributing

🕮 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

iscc-tika

Why iscc-tika?

🌳 Key Features

🚀 Quickstart

Python

Rust

📄 Supported file formats

Development

Prerequisites

Setup

Tasks

Quality Gates

🤝 Contributing

🕮 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages