Skip to content

Not all pages contain text. Empty sheets in books/volumes do not require transcriptions. However, sometimes an (upload) error may have occurred and one could want to check whether there might be empty pages within the PageXML that are unexpected. This script reads the pagexml-files and provides a list of pages (per document) that have no text.

License

Notifications You must be signed in to change notification settings

CARomein/PageXML_Empty_Page_Finder

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

Empty Page Detection Tool for Transkribus PageXML

A Python tool for identifying pages without transcribed text content in Transkribus PageXML collections. The tool scans through collection directories, examines each PageXML file, and generates an Excel report listing all pages that contain no transcribed text.

Purpose

During the transcription process of historical sources in Transkribus, researchers may encounter pages that should not be transcribed (blank pages, illustrations without text, severely damaged folios, or pages that were inadvertently included in the export). Once collections grow beyond a few dozen pages, manually identifying these empty pages becomes impractical. This tool addresses this challenge by automatically scanning PageXML files and identifying pages where no text has been transcribed, enabling researchers to verify whether these pages were intentionally left blank or require attention.

The tool examines the <Unicode> elements within <TextLine> elements in each PageXML file. A page is considered empty if it either contains no TextLine elements at all, or if all TextLine elements contain empty or whitespace-only Unicode content. This approach ensures that pages with any transcribed text, regardless of length, are not flagged as empty.

Requirements

The tool requires Python 3.7 or higher and relies primarily on the Python standard library. For Excel output, the openpyxl library is required. If openpyxl is not available, the tool automatically falls back to CSV output format.

To install the Excel dependency, execute the following command:

pip install openpyxl --break-system-packages

The --break-system-packages flag is necessary for newer Python installations that enforce PEP 668 restrictions on system-wide package installations.

Installation

No formal installation is required beyond ensuring Python 3.7 or higher is available on the system. Download the detect_empty_pages.py script and place it in a convenient location. The script can be executed directly from any directory, and the output file will be created in the directory where the script is invoked (or in a specified output location).

PageXML Collection Structure

The tool expects PageXML collections organised according to the following structure:

collections_base/
├── Collection_A/
│   └── page/
│       ├── 0001_archive_collection_00001.xml
│       ├── 0002_archive_collection_00002.xml
│       └── 0003_archive_collection_00003.xml
├── Collection_B/
│   └── page/
│       ├── 0001_archive_collection_00001.xml
│       └── 0002_archive_collection_00002.xml
└── Collection_C/
    └── page/
        ├── 0001_archive_collection_00001.xml
        ├── 0002_archive_collection_00002.xml
        └── 0003_archive_collection_00003.xml

Each collection directory must contain a subdirectory named page that houses the individual PageXML files. The script automatically discovers all collections by scanning for directories containing such page subdirectories with XML files.

Usage

Basic Invocation

To scan all collections in a base directory and generate a report:

python detect_empty_pages.py /path/to/collections

The tool processes all PageXML files across all collections and generates an Excel file named empty_pages.xlsx in the same directory where the detect_empty_pages.py script is located.

Specifying Output Location

To save the report in a specific location or with a custom filename:

python detect_empty_pages.py /path/to/collections --output /path/to/my_report.xlsx

Interactive Path Entry

If the base path is not provided as a command-line argument, the tool prompts for interactive entry:

python detect_empty_pages.py

The tool will display: Enter path to collections directory:

Suppressing Progress Output

For use in automated workflows or when cleaner output is desired:

python detect_empty_pages.py /path/to/collections --quiet

Command-Line Options

The tool accepts the following arguments and options:

Positional Arguments:

  • base_path: Path to the base directory containing PageXML collections (optional if interactive entry is preferred)

Optional Arguments:

  • -h, --help: Display help message and exit
  • -o OUTPUT, --output OUTPUT: Specify output Excel file path (default: empty_pages.xlsx in script directory)
  • -q, --quiet: Suppress progress output during processing

Output Format

The generated Excel file contains three columns:

Collection: The name of the collection directory (e.g., "Collection_A", "0018")

Image Filename: The original image filename as recorded in the PageXML imageFilename attribute of the <Page> element. This typically reflects the scan filename (e.g., "0051_NL-ZlHCO_0003.1_0001_00051.jpg")

XML Filename: The name of the PageXML file itself (e.g., "0051_NL-ZlHCO_0003_1_0001_00051.xml")

The Excel file includes formatted headers with styling for improved readability. Column widths are automatically adjusted to accommodate typical filename lengths. If the openpyxl library is not available, the tool creates a CSV file instead, which can be opened in Excel, LibreOffice Calc, or any spreadsheet application.

Example Session

A typical usage session might proceed as follows:

$ python detect_empty_pages.py ~/Documents/Resoluties/PageXML

=== Empty Page Detection Tool ===

Found 3 collection(s):
  - 0018
  - 0019
  - 0020

  Processing collection: 0018
  Found 156 XML files
    Processed 156/156 files    
  Found 3 empty page(s) in 0018

  Processing collection: 0019
  Found 142 XML files
    Processed 142/142 files    
  Found 0 empty page(s) in 0019

  Processing collection: 0020
  Found 178 XML files
    Processed 178/178 files    
  Found 2 empty page(s) in 0020

=== Summary ===
Total empty pages found: 5
Collections processed: 3

Generating output file...
✓ Report generated: /home/user/empty_pages.xlsx
  Open the file in Excel or LibreOffice to view the results

How It Works

Collection Discovery

The tool begins by scanning the provided base directory for valid collection structures. It iterates through all immediate subdirectories, checking each for the presence of a page subdirectory containing XML files. This approach accommodates both single-collection and multi-collection directory structures without requiring additional configuration.

Collections are processed in alphabetical order by directory name, ensuring consistent output across multiple executions.

Empty Page Detection

For each PageXML file, the tool employs Python's ElementTree parser with namespace awareness. The PageXML format uses the namespace http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15, which the tool registers for consistent element lookup.

The detection algorithm examines all <TextLine> elements within each page. If no TextLine elements exist, the page is immediately classified as empty. If TextLine elements are present, the tool searches for <Unicode> elements within each TextLine. A page is considered to contain text only if at least one Unicode element contains non-whitespace characters. This approach ensures that pages with even minimal transcribed content (such as page numbers or marginalia) are not incorrectly flagged as empty.

Metadata Extraction

For each identified empty page, the tool extracts the image filename from the imageFilename attribute of the <Page> element. This filename typically corresponds to the original scan filename in Transkribus and provides a clear reference for locating the page in question. If the imageFilename attribute is not present (which may occur in older PageXML exports), the tool falls back to using the XML filename stem.

Output Generation

The tool attempts to generate output in Excel format using the openpyxl library. If this library is not available, it automatically switches to CSV format and notifies the user. The Excel output includes styled headers and adjusted column widths for improved readability, particularly useful when dealing with lengthy archive reference numbers in filenames.

Use Cases

Quality Control

After completing a transcription project or receiving PageXML exports from Transkribus, researchers can use this tool to verify that all expected pages have been transcribed. Empty pages may indicate scans that were inadvertently included in the export, pages that failed during HTR processing, or genuine blank folios that require documentation.

Project Planning

Before beginning detailed work with a collection, researchers can use the tool to assess the completeness of existing transcriptions. The report provides a clear overview of which pages require attention, enabling more accurate project time estimates and resource allocation.

Export Validation

When exporting PageXML from Transkribus for further processing or analysis, this tool helps ensure that the export contains the expected content. Unexpected empty pages may indicate export errors or incomplete downloads that should be addressed before proceeding with downstream analysis.

Documentation

The generated report serves as documentation of the collection's state, noting which pages contain no transcribed text. This documentation may be relevant for grant reports, publication methods sections, or archival deposit requirements.

Troubleshooting

No Collections Found

If the tool reports that no collections were found, verify that the directory structure matches the expected format. Each collection directory must contain a subdirectory named precisely page (lowercase) containing XML files. The tool does not recursively search beyond the immediate subdirectories of the base path.

XML Parsing Warnings

The tool includes error handling for malformed XML files. If warnings appear about unparseable files, the affected files may be corrupted or incomplete. Re-exporting these files from Transkribus typically resolves the issue. The tool continues processing remaining files even when individual files fail to parse.

Missing openpyxl Library

If the tool falls back to CSV output with a message about openpyxl not being available, install the library using the command provided earlier. CSV output contains the same information as Excel output and can be opened in any spreadsheet application, but lacks the formatting enhancements of the Excel version.

Unexpected Empty Pages

If the tool identifies pages as empty that you believe contain text, examine the PageXML files directly to verify their content. Some export configurations or processing workflows may create TextLine elements without Unicode content, which the tool correctly identifies as empty. Additionally, pages containing only structural elements (such as reading order definitions) without actual transcribed text are correctly identified as empty.

Technical Details

Performance Characteristics

Processing performance scales approximately linearly with the number of XML files. On modern hardware, the tool typically processes between 50 and 100 files per second, depending on file size and system I/O performance. A collection of 1000 files can be processed in under one minute.

Memory usage remains modest even for large collections, as files are processed sequentially rather than loading the entire collection into memory simultaneously. Peak memory usage is typically under 100 MB regardless of collection size.

Character Encoding

All file operations use UTF-8 encoding explicitly, ensuring correct handling of filenames containing diacritical marks, special characters, or non-ASCII Unicode symbols. The Excel output also uses UTF-8 encoding to preserve filename integrity across different operating systems and applications.

Error Handling

The tool implements comprehensive error handling to ensure robust operation across varied collection structures and PageXML formats. Malformed XML files generate warnings but do not halt processing of remaining files. Missing attributes or unexpected XML structures are handled gracefully with fallback values. This approach ensures that the tool completes its analysis even when encountering occasional problematic files.

Limitations

The tool identifies pages based solely on the presence or absence of transcribed text in Unicode elements. It does not distinguish between genuinely blank pages and pages that failed to process correctly in Transkribus. Manual review of identified empty pages is recommended to determine whether they represent expected blank folios or require transcription attention.

The tool processes each file independently and does not maintain contextual information about document structure or pagination. Multi-page documents where certain pages are intentionally blank (such as versos in single-sided documents) are treated identically to pages that lack transcription due to processing failures.

The output report lists pages in the order they are processed (alphabetical by collection and filename), not in document order. If document-sequential ordering is required, the Excel file should be sorted by the Image Filename or XML Filename columns after generation.

Integration with Other Tools

The output format is designed to facilitate integration with other analytical workflows. The Excel file can be filtered and sorted to identify patterns (such as all empty pages occurring in a particular collection), and the structured format enables programmatic post-processing if needed.

The CSV fallback format ensures compatibility with systems where Excel format handling is problematic or where further processing will be performed using command-line tools or statistical software expecting tabular data.

Licence

This tool is provided under the MIT Licence, permitting free use, modification, and distribution with attribution.

Version History

Version 1.0 (2025): Initial release with collection discovery, PageXML parsing, empty page detection, and Excel/CSV output generation.

About

Not all pages contain text. Empty sheets in books/volumes do not require transcriptions. However, sometimes an (upload) error may have occurred and one could want to check whether there might be empty pages within the PageXML that are unexpected. This script reads the pagexml-files and provides a list of pages (per document) that have no text.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages