Skip to content

About cache limited by memory #947

@mgautierfr

Description

@mgautierfr

Today, the internal cache system is limited by the number of item in the cache.
And this is a cache per zim file. So opening several zim files increase the limit of item saved.
On top of that, limiting per number of item is kind of useless for the user as we don't know the memory used by each item.

From a lot of discussion, it seems that we all agree that it would be better to limit the cache with a memory limit.

This issue tries to list all change that implies in libzim.

We have to compute the size of each item.

For dirents, it is the sum of:

  • a fixed size (dirent structure, depending of the arch)
  • a size known at initialization (depending of len of url and title)

For clusters, it is the sum of:

  • a fixed size (cluster structure, depending of the arch)
  • a size known at initialization (nb of offset, depending of the number of blobs in the cluster)
  • a variable size ((pre)allocated buffers to store uncompressed data)
  • a variable size (internal memory allocated by decompressor)

The variable size of the cluster means that used memory store in the cache may increase after we have added the items to the cache and without modification of the cache itself.

We have to create a global cache.

As we want to limit the cache memory size of the whole process, we want only one cache for the whole process.
(Or maybe only one limit shared by all caches)

There is two reunification to do:

  • Per zim file cache to one cache
  • Per data (dirent vs cluster) to one cache

So we have to extend the key from a u32 (entry_index_type or cluster_index_type) to a tuple (u8[16], u8, u32) (zim_uuid, type, index).
So the key size goes from 4 bytes to 21 bytes (likely 24 bytes with padding).
As we have a cache to avoid parsing a (16 bytes + title + url) size buffer and the cache key will have to be hashed and compared each time we want to access an item, we may want to investigate a bit here.

If, in the future, we implement a cache url->dirent (as suggested in #438), the key itself will have to change again

Cache eviction

There are three moments when we want to remove entries from the cache:

  • When we destroy a zim reader, we want to remove all its entries from the cache.
  • When we want to add an entry in it and the cache is full
  • When cache memory increase because user is reading from a cluster.

First point is pretty easy. Other points need us to define a algorithm to use to select which item to remove.
A basic lru cache eviction, removing items until we have free enough memory, may be enough. Or not.


Out of scope:

This issue is only about limiting the memory usage.
Our cache system also contains other caches whom are (mainly) causing other issues than memory.
FastDirentLookup and yet to implement xapian preloading (#617) are consuming time at zim creation. #946 may still be necessary to configure them.

This will of course not limit the memory usage of the whole process. Opening too many zims and reading from them means that we have to keep at least several cluster in memory (not necessarily in cache). On low memory device it can be a problem. This is not new and pretty rare. It is just that limiting the cache will not magically fix that.

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions