Reduce memory usage

(Related to https://github.com/ragibson/SMS-MMS-deduplication/issues/8)

This script currently uses several times the input file size in memory because it
- Holds the entire XML tree in memory
- Maintains multiple bookkeeping dictionaries containing significant portions of XML elements for deduplication purposes

It is possible to refactor the script to use very little memory (and not scale as much with input size) by:
- Using `iterparse(..., huge_tree=True)` in an initial pass to identify duplicates without loading the full tree
- Possibly hashing XML element data to detect likely duplicates (and verify exact matches in a later pass). A counting Bloom filter is another good option
- Writing both the final output file (via `xmlfile`) and the log file incrementally

This becomes increasingly useful for larger backup files. Note that
- This increases runtime, but initial tests suggest ~10 seconds per GB, which feels reasonable
- It won't fix the issue with all known XML libraries failing on extremely large elements (>1 GB), but it may resolve similar memory-management bugs in those libraries


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce memory usage #18

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Reduce memory usage #18

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions