Skip to content

Reduce memory usage #18

@ragibson

Description

@ragibson

(Related to #8)

This script currently uses several times the input file size in memory because it

  • Holds the entire XML tree in memory
  • Maintains multiple bookkeeping dictionaries containing significant portions of XML elements for deduplication purposes

It is possible to refactor the script to use very little memory (and not scale as much with input size) by:

  • Using iterparse(..., huge_tree=True) in an initial pass to identify duplicates without loading the full tree
  • Possibly hashing XML element data to detect likely duplicates (and verify exact matches in a later pass). A counting Bloom filter is another good option
  • Writing both the final output file (via xmlfile) and the log file incrementally

This becomes increasingly useful for larger backup files. Note that

  • This increases runtime, but initial tests suggest ~10 seconds per GB, which feels reasonable
  • It won't fix the issue with all known XML libraries failing on extremely large elements (>1 GB), but it may resolve similar memory-management bugs in those libraries

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions