Challenge 1A required us to extract a structured document outline from unstructured PDF files. This includes:
- Extracting the document title from the largest font block.
- Identifying section headings using a combination of font features and layout rules.
- Categorizing headings into hierarchical levels (H1, H2, H3) using font size or numbered patterns.
- Saving the extracted data in a clean, structured JSON format.
βββ app/
β βββ main.py # Main script to extract title + outline from all PDFs
β βββ requirements.txt # Dependencies (PyMuPDF, etc.)
β βββ Dockerfile # Docker setup for Round 1A
β βββ input/ # Place PDF files here
β βββ output/ # JSON output files generated
- Extracts clean document titles (even if spread across multiple lines).
- Detects headings based on:
- Font size
- Bold/Italic style
- Numbered pattern (e.g., 1.2, 2.1.3)
- Ensures no duplicate or footer content is included.
- Tags headings with hierarchical level:
H1,H2,H3.
Install required libraries:
pip install -r requirements.txtLibraries used:
PyMuPDF (fitz)β for reading PDFs and fontsre,os,jsonβ for processing and formatting
Put your PDF files inside the input/ folder.
python main.pyThis will create a .json output for each PDF inside the output/ folder.
docker build -t round1a-extractor .docker run --rm -v ${PWD}/input:/app/input -v ${PWD}/output:/app/output round1a-extractorYour output will be saved in the output/ folder.
You can download this submission as a zip:
Strugglers
Feel free to reach out for any doubts or improvements!
- Round 1A is offline-compatible.
- No AI models are used for classification; rules are applied based on font metrics.
- Docker makes the setup universal across systems.