folioforge

A library wrapping various PDF extraction tools with common workflows and tooling.

Currently supports Docling as well as the three DoclayoutYOLO models with PaddleOCR text recognition.

Installation

for CLI use:

$ uv tool install git+https://github.com/SwissDataScienceCenter/folioforge

as a library:

$ uv add "folioforge @ git+https://github.com/SwissDataScienceCenter/folioforge"

Usage

CLI

$ folioforge convert myfile.pdf

 Usage: folioforge convert [OPTIONS] PATHS...

 Convert PDF documents to text.

╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *    paths      PATHS...  path of PDFs to convert [required]                                                                            │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --preprocessor                  [pdf]                                               the preprocessor to use, can specify multiple       │
│                                                                                     [default: (dynamic)]                                │
│ --pipeline                      [simple|dask]                                       the pipeline executor [default: simple]             │
│ --extractor                     [docling|doclayout_yolo_doclaynet|doclayout_yolo_d  the model to use [default: docling]                 │
│                                 4la|doclayout_yolo_docstructbench]                                                                      │
│ --format                        [passthrough|markdown|json|html]                    what format to create results in                    │
│                                                                                     [default: markdown]                                 │
│ --debug           --no-debug                                                        turn on debug mode, stores annotated images in      │
│                                                                                     output folder                                       │
│                                                                                     [default: no-debug]                                 │
│ --confidence                    FLOAT                                               the minimum confidence threshold for layout         │
│                                                                                     detection                                           │
│                                                                                     [default: 0.2]                                      │
│ --help                                                                              Show this message and exit.                         │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Usage: folioforge evaluate [OPTIONS] PATHS...

 Run a selection of models over a set of PDFs and create a report comparing the results.

╭─ Arguments ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ *    paths      PATHS...  [required]                                                                                                    │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭─ Options ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ --preprocessor        [pdf]                                                    [default: (dynamic)]                                     │
│ --extractors          [docling|doclayout_yolo_doclaynet|doclayout_yolo_d4la|d  [default: (dynamic)]                                     │
│                       oclayout_yolo_docstructbench]                                                                                     │
│ --pipeline            [simple|dask]                                            [default: simple]                                        │
│ --output              PATH                                                     [default: output.html]                                   │
│ --help                                                                         Show this message and exit.                              │
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Library

from folioforge.pipeline.simple import SimplePipelineExecutor
from folioforge.preprocessor.pdf import PDFPreprocessor
from folioforge.extraction.docling import DoclingExtractor
from folioforge.output.markdown import MarkdownGenerator

paths = ["myfile.pdf"]

pipeline = SimplePipelineExecutor.setup(preprocessors=[PDFPreprocessor()], extractor=DoclingExtractor(), format=MarkdownGenerator())
result = pipeline.execute(paths)
for r in result:
    print(r)

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github/workflows		.github/workflows
src/folioforge		src/folioforge
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

folioforge

Installation

Usage

CLI

Library

About

Uh oh!

Releases

Packages

Languages

SwissDataScienceCenter/folioforge

Folders and files

Latest commit

History

Repository files navigation

folioforge

Installation

Usage

CLI

Library

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages