A Python tool for identifying pages without transcribed text content in Transkribus PageXML collections. The tool scans through collection directories, examines each PageXML file, and generates an Excel report listing all pages that contain no transcribed text.
During the transcription process of historical sources in Transkribus, researchers may encounter pages that should not be transcribed (blank pages, illustrations without text, severely damaged folios, or pages that were inadvertently included in the export). Once collections grow beyond a few dozen pages, manually identifying these empty pages becomes impractical. This tool addresses this challenge by automatically scanning PageXML files and identifying pages where no text has been transcribed, enabling researchers to verify whether these pages were intentionally left blank or require attention.
The tool examines the <Unicode> elements within <TextLine> elements in each PageXML file. A page is considered empty if it either contains no TextLine elements at all, or if all TextLine elements contain empty or whitespace-only Unicode content. This approach ensures that pages with any transcribed text, regardless of length, are not flagged as empty.
The tool requires Python 3.7 or higher and relies primarily on the Python standard library. For Excel output, the openpyxl library is required. If openpyxl is not available, the tool automatically falls back to CSV output format.
To install the Excel dependency, execute the following command:
pip install openpyxl --break-system-packages
The --break-system-packages flag is necessary for newer Python installations that enforce PEP 668 restrictions on system-wide package installations.
No formal installation is required beyond ensuring Python 3.7 or higher is available on the system. Download the detect_empty_pages.py script and place it in a convenient location. The script can be executed directly from any directory, and the output file will be created in the directory where the script is invoked (or in a specified output location).
The tool expects PageXML collections organised according to the following structure:
collections_base/
├── Collection_A/
│ └── page/
│ ├── 0001_archive_collection_00001.xml
│ ├── 0002_archive_collection_00002.xml
│ └── 0003_archive_collection_00003.xml
├── Collection_B/
│ └── page/
│ ├── 0001_archive_collection_00001.xml
│ └── 0002_archive_collection_00002.xml
└── Collection_C/
└── page/
├── 0001_archive_collection_00001.xml
├── 0002_archive_collection_00002.xml
└── 0003_archive_collection_00003.xml
Each collection directory must contain a subdirectory named page that houses the individual PageXML files. The script automatically discovers all collections by scanning for directories containing such page subdirectories with XML files.
To scan all collections in a base directory and generate a report:
python detect_empty_pages.py /path/to/collectionsThe tool processes all PageXML files across all collections and generates an Excel file named empty_pages.xlsx in the same directory where the detect_empty_pages.py script is located.
To save the report in a specific location or with a custom filename:
python detect_empty_pages.py /path/to/collections --output /path/to/my_report.xlsxIf the base path is not provided as a command-line argument, the tool prompts for interactive entry:
python detect_empty_pages.pyThe tool will display: Enter path to collections directory:
For use in automated workflows or when cleaner output is desired:
python detect_empty_pages.py /path/to/collections --quietThe tool accepts the following arguments and options:
Positional Arguments:
base_path: Path to the base directory containing PageXML collections (optional if interactive entry is preferred)
Optional Arguments:
-h, --help: Display help message and exit-o OUTPUT, --output OUTPUT: Specify output Excel file path (default:empty_pages.xlsxin script directory)-q, --quiet: Suppress progress output during processing
The generated Excel file contains three columns:
Collection: The name of the collection directory (e.g., "Collection_A", "0018")
Image Filename: The original image filename as recorded in the PageXML imageFilename attribute of the <Page> element. This typically reflects the scan filename (e.g., "0051_NL-ZlHCO_0003.1_0001_00051.jpg")
XML Filename: The name of the PageXML file itself (e.g., "0051_NL-ZlHCO_0003_1_0001_00051.xml")
The Excel file includes formatted headers with styling for improved readability. Column widths are automatically adjusted to accommodate typical filename lengths. If the openpyxl library is not available, the tool creates a CSV file instead, which can be opened in Excel, LibreOffice Calc, or any spreadsheet application.
A typical usage session might proceed as follows:
$ python detect_empty_pages.py ~/Documents/Resoluties/PageXML
=== Empty Page Detection Tool ===
Found 3 collection(s):
- 0018
- 0019
- 0020
Processing collection: 0018
Found 156 XML files
Processed 156/156 files
Found 3 empty page(s) in 0018
Processing collection: 0019
Found 142 XML files
Processed 142/142 files
Found 0 empty page(s) in 0019
Processing collection: 0020
Found 178 XML files
Processed 178/178 files
Found 2 empty page(s) in 0020
=== Summary ===
Total empty pages found: 5
Collections processed: 3
Generating output file...
✓ Report generated: /home/user/empty_pages.xlsx
Open the file in Excel or LibreOffice to view the resultsThe tool begins by scanning the provided base directory for valid collection structures. It iterates through all immediate subdirectories, checking each for the presence of a page subdirectory containing XML files. This approach accommodates both single-collection and multi-collection directory structures without requiring additional configuration.
Collections are processed in alphabetical order by directory name, ensuring consistent output across multiple executions.
For each PageXML file, the tool employs Python's ElementTree parser with namespace awareness. The PageXML format uses the namespace http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15, which the tool registers for consistent element lookup.
The detection algorithm examines all <TextLine> elements within each page. If no TextLine elements exist, the page is immediately classified as empty. If TextLine elements are present, the tool searches for <Unicode> elements within each TextLine. A page is considered to contain text only if at least one Unicode element contains non-whitespace characters. This approach ensures that pages with even minimal transcribed content (such as page numbers or marginalia) are not incorrectly flagged as empty.
For each identified empty page, the tool extracts the image filename from the imageFilename attribute of the <Page> element. This filename typically corresponds to the original scan filename in Transkribus and provides a clear reference for locating the page in question. If the imageFilename attribute is not present (which may occur in older PageXML exports), the tool falls back to using the XML filename stem.
The tool attempts to generate output in Excel format using the openpyxl library. If this library is not available, it automatically switches to CSV format and notifies the user. The Excel output includes styled headers and adjusted column widths for improved readability, particularly useful when dealing with lengthy archive reference numbers in filenames.
After completing a transcription project or receiving PageXML exports from Transkribus, researchers can use this tool to verify that all expected pages have been transcribed. Empty pages may indicate scans that were inadvertently included in the export, pages that failed during HTR processing, or genuine blank folios that require documentation.
Before beginning detailed work with a collection, researchers can use the tool to assess the completeness of existing transcriptions. The report provides a clear overview of which pages require attention, enabling more accurate project time estimates and resource allocation.
When exporting PageXML from Transkribus for further processing or analysis, this tool helps ensure that the export contains the expected content. Unexpected empty pages may indicate export errors or incomplete downloads that should be addressed before proceeding with downstream analysis.
The generated report serves as documentation of the collection's state, noting which pages contain no transcribed text. This documentation may be relevant for grant reports, publication methods sections, or archival deposit requirements.
If the tool reports that no collections were found, verify that the directory structure matches the expected format. Each collection directory must contain a subdirectory named precisely page (lowercase) containing XML files. The tool does not recursively search beyond the immediate subdirectories of the base path.
The tool includes error handling for malformed XML files. If warnings appear about unparseable files, the affected files may be corrupted or incomplete. Re-exporting these files from Transkribus typically resolves the issue. The tool continues processing remaining files even when individual files fail to parse.
If the tool falls back to CSV output with a message about openpyxl not being available, install the library using the command provided earlier. CSV output contains the same information as Excel output and can be opened in any spreadsheet application, but lacks the formatting enhancements of the Excel version.
If the tool identifies pages as empty that you believe contain text, examine the PageXML files directly to verify their content. Some export configurations or processing workflows may create TextLine elements without Unicode content, which the tool correctly identifies as empty. Additionally, pages containing only structural elements (such as reading order definitions) without actual transcribed text are correctly identified as empty.
Processing performance scales approximately linearly with the number of XML files. On modern hardware, the tool typically processes between 50 and 100 files per second, depending on file size and system I/O performance. A collection of 1000 files can be processed in under one minute.
Memory usage remains modest even for large collections, as files are processed sequentially rather than loading the entire collection into memory simultaneously. Peak memory usage is typically under 100 MB regardless of collection size.
All file operations use UTF-8 encoding explicitly, ensuring correct handling of filenames containing diacritical marks, special characters, or non-ASCII Unicode symbols. The Excel output also uses UTF-8 encoding to preserve filename integrity across different operating systems and applications.
The tool implements comprehensive error handling to ensure robust operation across varied collection structures and PageXML formats. Malformed XML files generate warnings but do not halt processing of remaining files. Missing attributes or unexpected XML structures are handled gracefully with fallback values. This approach ensures that the tool completes its analysis even when encountering occasional problematic files.
The tool identifies pages based solely on the presence or absence of transcribed text in Unicode elements. It does not distinguish between genuinely blank pages and pages that failed to process correctly in Transkribus. Manual review of identified empty pages is recommended to determine whether they represent expected blank folios or require transcription attention.
The tool processes each file independently and does not maintain contextual information about document structure or pagination. Multi-page documents where certain pages are intentionally blank (such as versos in single-sided documents) are treated identically to pages that lack transcription due to processing failures.
The output report lists pages in the order they are processed (alphabetical by collection and filename), not in document order. If document-sequential ordering is required, the Excel file should be sorted by the Image Filename or XML Filename columns after generation.
The output format is designed to facilitate integration with other analytical workflows. The Excel file can be filtered and sorted to identify patterns (such as all empty pages occurring in a particular collection), and the structured format enables programmatic post-processing if needed.
The CSV fallback format ensures compatibility with systems where Excel format handling is problematic or where further processing will be performed using command-line tools or statistical software expecting tabular data.
This tool is provided under the MIT Licence, permitting free use, modification, and distribution with attribution.
Version 1.0 (2025): Initial release with collection discovery, PageXML parsing, empty page detection, and Excel/CSV output generation.