Skip to content

iscc/iscc-tika

Repository files navigation

iscc-tika

License Commits per month Downloads

iscc-tika offers a fast and efficient solution for extracting content and metadata from various document types such as PDF, Word, HTML, and many other formats. A comprehensive solution in Rust with Python bindings.


Note: This project is a fork of yobix-ai/extractous, revived and maintained by the ISCC Foundation.

Why iscc-tika?

  • Built with Rust: The core is developed in Rust, leveraging its high performance, memory safety, multi-threading capabilities, and zero-cost abstractions.
  • Extended format support with Apache Tika: For file formats not natively supported by the Rust core, we compile the well-known Apache Tika into native shared libraries using GraalVM ahead-of-time compilation technology. These shared libraries are then linked to and called from our Rust core. No local servers, no virtual machines, or any garbage collection, just pure native execution.
  • Python bindings: Python binding is a wrapper around the Rust core with the potential to circumvent the Python GIL limitation and make efficient use of multi-cores.

๐ŸŒณ Key Features

  • High-performance unstructured data extraction optimized for speed and low memory usage.
  • Clear and simple API for extracting text and metadata content.
  • Automatically identifies document types and extracts content accordingly.
  • Supports many file formats (most formats supported by Apache Tika).
  • Extracts text from images and scanned documents with OCR through tesseract-ocr.
  • Core engine written in Rust with bindings for Python.
  • Free for Commercial Use: Apache 2.0 License.

๐Ÿš€ Quickstart

Python

  • Extract a file content to a string:
from iscc_tika import Extractor

# Create a new extractor
extractor = Extractor()
extractor = extractor.set_extract_string_max_length(1000)
# if you need an xml
# extractor = extractor.set_xml_output(True)

# Extract text from a file
result, metadata = extractor.extract_file_to_string("README.md")
print(result)
print(metadata)
  • Extract only document metadata (faster for text-heavy PDFs, skips full text extraction):
from iscc_tika import Extractor

extractor = Extractor()

# Returns just the metadata dict โ€” no text extraction overhead.
metadata = extractor.extract_file_metadata("README.md")
print(metadata.get("Content-Type"))

# Also available for bytes and URLs:
# metadata = extractor.extract_bytes_metadata(buffer)
# metadata = extractor.extract_url_metadata("https://www.google.com/")

Metadata is best-effort: keys that Tika only populates during content iteration (e.g. pdf:charsPerPage) may be absent. Core identification metadata (Content-Type, Dublin Core fields, Office core properties) is present.

  • Extracting a file(URL / bytearray) to a buffered stream:
from iscc_tika import Extractor

extractor = Extractor()
# if you need an xml
# extractor = extractor.set_xml_output(True)

# for file
reader, metadata = extractor.extract_file("tests/quarkus.pdf")
# for url
# reader, metadata = extractor.extract_url("https://www.google.com")
# for bytearray
# with open("tests/quarkus.pdf", "rb") as file:
#     buffer = bytearray(file.read())
# reader, metadata = extractor.extract_bytes(buffer)

result = ""
buffer = reader.read(4096)
while len(buffer) > 0:
    result += buffer.decode("utf-8")
    buffer = reader.read(4096)

print(result)
print(metadata)
  • Extracting a file with OCR:

You need to have Tesseract installed with the language pack. For example on debian sudo apt install tesseract-ocr tesseract-ocr-deu

from iscc_tika import Extractor, TesseractOcrConfig

extractor = Extractor().set_ocr_config(TesseractOcrConfig().set_language("deu"))
result, metadata = extractor.extract_file_to_string(
    "../../test_files/documents/eng-ocr.pdf"
)

print(result)
print(metadata)

Rust

  • Extract a file content to a string:
use iscc_tika::Extractor;

fn main() {
    // Create a new extractor. Note it uses a consuming builder pattern
    let mut extractor = Extractor::new().set_extract_string_max_length(1000);
    // if you need an xml
    // extractor = extractor.set_xml_output(true);

    // Extract text from a file
    let (text, metadata) = extractor.extract_file_to_string("README.md").unwrap();
    println!("{}", text);
    println!("{:?}", metadata);
}
  • Extract a content of a file(URL/ bytes) to a StreamReader and perform buffered reading
use std::io::{BufReader, Read};
// use std::fs::File; use for bytes
use iscc_tika::Extractor;

fn main() {
    // Get the command-line arguments
    let args: Vec<String> = std::env::args().collect();
    let file_path = &args[1];

    // Extract the provided file content to a string
    let extractor = Extractor::new();
    // if you need an xml
    // extractor = extractor.set_xml_output(true);

    let (stream, metadata) = extractor.extract_file(file_path).unwrap();
    // Extract url
    // let (stream, metadata) = extractor.extract_url("https://www.google.com/").unwrap();
    // Extract bytes
    // let mut file = File::open(file_path)?;
    // let mut buffer = Vec::new();
    // file.read_to_end(&mut buffer)?;
    // let (stream, metadata) = extractor.extract_bytes(&file_bytes);

    // Because stream implements std::io::Read trait we can perform buffered reading
    // For example we can use it to create a BufReader
    let mut reader = BufReader::new(stream);
    let mut buffer = Vec::new();
    reader.read_to_end(&mut buffer).unwrap();

    println!("{}", String::from_utf8(buffer).unwrap());
    println!("{:?}", metadata);
}
  • Extract content of PDF with OCR.

You need to have Tesseract installed with the language pack. For example on debian sudo apt install tesseract-ocr tesseract-ocr-deu

use iscc_tika::Extractor;

fn main() {
  let file_path = "../test_files/documents/deu-ocr.pdf";

    let extractor = Extractor::new()
          .set_ocr_config(TesseractOcrConfig::new().set_language("deu"))
          .set_pdf_config(PdfParserConfig::new().set_ocr_strategy(PdfOcrStrategy::OCR_ONLY));
    // extract file with extractor
  let (content, metadata) = extractor.extract_file_to_string(file_path).unwrap();
  println!("{}", content);
  println!("{:?}", metadata);
}

๐Ÿ“„ Supported file formats

Category Supported Formats Notes
Microsoft Office DOC, DOCX, PPT, PPTX, XLS, XLSX, RTF Includes legacy and modern Office file formats
OpenOffice ODT, ODS, ODP OpenDocument formats
PDF PDF Can extracts embedded content and supports OCR
Spreadsheets CSV, TSV Plain text spreadsheet formats
Web Documents HTML, XML Parses and extracts content from web documents
E-Books EPUB EPUB format for electronic books
Text Files TXT, Markdown Plain text formats
Images PNG, JPEG, TIFF, BMP, GIF, ICO, PSD, SVG Extracts embedded text with OCR
E-Mail EML, MSG, MBOX, PST Extracts content, headers, and attachments

Development

Prerequisites

  • Rust (stable toolchain)
  • uv (Python package manager)
  • mise (task runner)
  • GraalVM / JDK 17+ (for building tika-native)

Setup

Install dev tools and git hooks:

uv tool install prek ruff taplo yamlfix
uv tool install mdformat --with 'mdformat-mkdocs[recommended]'
prek install && prek install --hook-type pre-push

Tasks

Run tasks with mise run <task>:

Task Description
check Run all pre-commit hooks on all files
lint Run format checks, clippy, and ruff
format Run pre-commit auto-fix hooks
test Run all tests (cargo test + pytest)
version:check Check that core and Python package versions match
pr:main Create a PR from develop to main

Quality Gates

Quality gates are enforced automatically via git hooks:

  • Pre-commit (fast, auto-fix): file hygiene, markdown formatting, cargo-fmt, ruff, taplo, yamlfix
  • Pre-push (thorough): cargo-clippy, cargo-test, security scan, complexity check, pytest

๐Ÿค Contributing

Contributions are welcome! Please open an issue or submit a pull request if you have any improvements or new features to propose.

๐Ÿ•ฎ License

This project is licensed under the Apache License 2.0. See the LICENSE file for details.

About

Fast text and metadata extraction from documents using Apache Tika compiled to native code via GraalVM

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors