Skip to content

Commit 332d6d5

Browse files
authored
Merge pull request #59 from mdeegen/master
Create doc folder and add immutability options
2 parents 4c072fe + 45a0537 commit 332d6d5

File tree

7 files changed

+4392
-1
lines changed

7 files changed

+4392
-1
lines changed

doc/ImumutabilityOptions.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
2+
## Immutability Options
3+
4+
5+
When creating a dataset you can choose between the immutability options `copy` and `pickle`. If you are working with multiple processes in parallel each child process will share its entire memory space with the main process.
6+
7+
Usually the data of the dataset consists of many python objects, e.g. `dict`, `list`, ... . When a process reads/touches a value, the reference counter for the object will be increased, which triggers a copy. So the Linux behaviour of "copy-on-write" is a "copy-on-read" for Python objects. Therefore if the processes access the dataset it will be loaded into the RAM multiple times.
8+
While the `copy` option gives a complete copy of the dataset to each process, the `pickle` option uses bytestreams and compresses the data so the memory usage is decreased compared to `copy`.
9+
10+
This was originally observed in https://ppwwyyxx.com/blog/2022/Demystify-RAM-Usage-in-Multiprocess-DataLoader/ and the blog also contains more details about it.
11+
12+
For datasets created from a list you can also choose a third option `wu` which uses numpy arrays instead of many python objects to store the dataset in the memory. By doing so the "copy-on-read" effect does not trigger and the child process can access the datset from the main process without copying the data.
13+
Therefore by using the `wu` option you can prevent each process to have its own copy of the dataset stored in the RAM and instead share the dataset from the main process. Because of that the dataset does not need to be loaded into the RAM multiple times and the RAM memory usage can be reduced as shown in the diagrams below.
14+
15+
![Memory Usage with copy](copy.svg)
16+
<div align="center">
17+
Immutable_warranty=: copy
18+
</div>
19+
20+
![Memory Usage with pickle](pickle.svg)
21+
<div align="center">
22+
Immutable_warranty: pickle
23+
</div>
24+
25+
![Memory Usage with wu](wu.svg)
26+
<div align="center">
27+
Immutable_warranty: wu
28+
</div>
29+
<br>
30+
31+
Where the displayed USS (Unique Set Size) is the sum of all single USSs of each worker. The USS of one worker is the amount of RAM that is unique to that process and not shared with other processes.
32+
33+
Shared in the diagram represents the mean RAM of the processes that is shared with other processes.
34+
35+
The displayed PSS ("Proportional Set Size") is the sum of all single PSSs. The PSS of one process is all the memory the proccess holds in RAM, so the sum of USS and shared, but the shared memory is divided by the number of processes sharing it.
36+
37+
38+
The `pickle` option delivers a good compression of the data and reduces the used memory for one worker.
39+
For multiple processes the `wu`-option can be more stable since the data does not need to be loaded into the RAM multiple times and the memory usage stays roughly constant over time.
40+
41+
42+
43+
44+
45+

0 commit comments

Comments
 (0)