-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Labels
enhancementNew feature or requestNew feature or request
Description
(Related to #8)
This script currently uses several times the input file size in memory because it
- Holds the entire XML tree in memory
- Maintains multiple bookkeeping dictionaries containing significant portions of XML elements for deduplication purposes
It is possible to refactor the script to use very little memory (and not scale as much with input size) by:
- Using
iterparse(..., huge_tree=True)in an initial pass to identify duplicates without loading the full tree - Possibly hashing XML element data to detect likely duplicates (and verify exact matches in a later pass). A counting Bloom filter is another good option
- Writing both the final output file (via
xmlfile) and the log file incrementally
This becomes increasingly useful for larger backup files. Note that
- This increases runtime, but initial tests suggest ~10 seconds per GB, which feels reasonable
- It won't fix the issue with all known XML libraries failing on extremely large elements (>1 GB), but it may resolve similar memory-management bugs in those libraries
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request