Skip to content

Implement chunk eviction#47

Merged
johnramsden merged 28 commits intomainfrom
sam/chunk-eviction-fix
Sep 11, 2025
Merged

Implement chunk eviction#47
johnramsden merged 28 commits intomainfrom
sam/chunk-eviction-fix

Conversation

@2a46m4
Copy link
Copy Markdown
Collaborator

@2a46m4 2a46m4 commented Aug 20, 2025

This PR (re)-implements chunk eviction.

The original approach special-cased the write-back of valid chunks during chunk eviction. The write-back occurred on the same zone and the zone list had a special function that would accepts a zone with a non-full number of chunks to be returned. This approach was somewhat error-prone and also caused a lot of issues with too many active zones, as opening the zone to be re-written to wasn't accounted in the zone_list's bookkeeping.

With the approach in this PR, the reader part is mostly the same, but the writer part now queues the writer pool with the valid chunks. This allows the zone_list to reclaim the zone while still being able to keep track of the number of active zones properly. Since the zone is going from full to empty when being reset, the number of active zones is not increased. Instead, the active zone is only increased when the zone_list actually hands out a chunk to be written to.

There is some change in the behaviour. Valid chunks that were reset are not immediately written back to and might take a while to get through the write_pool even if they are active. This can be a problem if they are very active. Since the valid chunks go through the write_pool path, the valid chunks will update their slot in the eviction policy, and the cache measurements will also account for them.

One thing to note is that the eviction no longer blocks when calling clean_zone_and_update_map. This is to avoid what would presumably be a deadlock situation: if clean_zone_and_update_map is blocking, when evicting, if the cache runs out of space again when writing back the chunks, the cache would have to evict again to make space for those chunks. However, there can only be one evict occurring at once, so a deadlock occurs. We spawn green threads to handle it so that the eviction thread can return immediately, and further eviction can occur.

I don't have measurements, but this might be slightly slower than the original implementation, because it has to go through the write_pool path.

@johnramsden
Copy link
Copy Markdown
Owner

I think we need to find a solution for this:

There is some change in the behaviour. Valid chunks that were reset are not immediately written back to and might take a while to get through the write_pool even if they are active. This can be a problem if they are very active. Since the valid chunks go through the write_pool path, the valid chunks will update their slot in the eviction policy, and the cache measurements will also account for them.

Perhaps by passing in some additional information that says do not update lru

2a46m4 added 9 commits August 20, 2025 17:56
We need to use an aligned vector because chunk eviction needs to read
valid chunks from the device and then write it back. The writeback
expects an aligned vector.
The RWLock allows multiple threads to append to the zone at the same
time if the chunk size is larger than the zone append size, and only
one thread to append if it's smaller. In addition, since the policy
doesn't check whether the zone is full before evicting it, we can
actually evict zones that are still being written to. Thus we need to
hold the lock while resetting the zone to make sure that we have
exclusive access to the zone.
Continuing from the previous commit, if the zone being resetted is an
open zone, this isn't actually a logic error, since we can evict zones
that are not full. In practice this is rare but still possible. This
is fine as long as we remove it and then put it back in the free zone list.
When evicting, we need to remove stale elements from the LRU queue. An
example:

LRU (Zone, Chunk) state:
[(1, 1), (1, 2), (2, 1), (1, 4), (3, 3)]

We choose to evict (1, 1) and (1, 2):
[(2, 1), (1, 4), (3, 3)]

(1, 4) is stale, because we reset the entire zone. This applies
whether we write in-place or not.
[(2, 1), (1, 4) { STALE }, (3, 3)]

So, we filter out all zones with 1.
[(2, 1), (3, 3)]

When we write it back to the device, we need to update the
LRU. Suppose we write it back to (2, 2). Then the LRU will look like:
[(2, 1), (3, 3), (2, 2)]
need to spawn it on runtime
@2a46m4
Copy link
Copy Markdown
Collaborator Author

2a46m4 commented Aug 24, 2025

Implemented a few fixes

Comment thread oxcache/src/device.rs
Comment thread oxcache/src/eviction.rs
Comment on lines +249 to +266
let mut clean_targets = self.pq.remove_if_thresh_met();
clean_targets.sort_unstable();

// Search in the LRU for items which exist in the clean
// targets. These are valid chunks, but must be removed
// because their location is invalidated when the zone is
// reset.
let new_lru_list = self.lru.iter().rev().filter(|lru_item| {
clean_targets.binary_search(&lru_item.0.zone).is_err()
}).map(|(k, _)| k.clone()).collect::<Vec<ChunkLocation>>();

self.lru.clear();

for k in new_lru_list {
self.lru.put(k, ());
}

clean_targets
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this expensive? Have to repopulate the whole lru every time we call get targets?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's more a temporary solution to get it working before I try anything fancy. A better solution would be to store the chunks in a hash map

Comment thread oxcache/src/zone_state/zone_list.rs
Eviction doesn't occur on Cortes when we have enough throughput, but it does work with low throughput.

We were stuck looping on cache is full

The issue comes down to eviction can't keep up with all the requests when we drain the channel. So it doesn't do enough eviction. Basically the issue occurs if you have enough throughput and we have to do multiple eviction attempts until there is enough space for all the channels and then drain.

Once I fixed that eviction does occur, however now we get a validation error invalidated read response. So I assume some data is being written back incorrectly or something. Potentially reads occurring from old data? Likely an issue with the map.
@johnramsden
Copy link
Copy Markdown
Owner

Eviction doesn't occur on Cortes when we have enough throughput, but it does work with low throughput.

We were stuck looping on cache is full

The issue comes down to eviction can't keep up with all the requests when we drain the channel. So it doesn't do enough eviction. Basically the issue occurs if you have enough throughput and we have to do multiple eviction attempts until there is enough space for all the channels and then drain.

Once I fixed that eviction does occur, however now we get a validation error invalidated read response. So I assume some data is being written back incorrectly or something. Potentially reads occurring from old data? Likely an issue with the map.

Pushed a patch for eviction, now we need to resolve the correctness.

Config used:

log_level = "debug"

[server]
socket = "/tmp/oxcache.sock"
disk = "/dev/nvme0n2"
writer_threads = 14
reader_threads = 14
chunk_size = 536870912
block_zone_capacity = 1129316352
max_write_size = 262144
max_zones = 20

[remote]
remote_type = "emulated" # emulated | S3
bucket = "S3_BUCKET"
remote_artificial_delay_microsec = 40632

[eviction]
eviction_policy = "chunk"
high_water_evict = 16 # Number remaining from end, evicts if reaches here
low_water_evict = 32  # Evict until below mark
eviction_interval = 1  # Evict every 1s
high_water_clean = 16 # Number remaining from end, cleans if reaches here (only used with chunk)
low_water_clean = 8  # Clean until below mark (only used with chunk)

[metrics]
ip_addr = "127.0.0.1"
port = 9000
file_metrics_directory = "./logs"

@johnramsden
Copy link
Copy Markdown
Owner

Validation failure occurs here:

debug_assert_eq!(u64::from_be_bytes(buffer), key_hash);

Comment thread oxcache/src/eviction.rs Outdated
Comment thread oxcache/src/eviction.rs Outdated
@2a46m4
Copy link
Copy Markdown
Collaborator Author

2a46m4 commented Aug 27, 2025

There is a difference between the cachemap and the eviction policy LRU queue which caused eviction to loop endlessly. This was because eviction was failing silently when evicted chunks are not found in the cache map. This was in turn caused by read LRU updates that were occurring after the zone was reset, so the chunk was registered in the LRU even though it wasn't active anymore. I fixed the eviction looping by not registering any read LRU updates if it's not in the queue, but the original problem i.e. the read after resets is still there. This causes chunk validation assertions to fail. I can think of two possible areas where that could be happening. either a race condition caused a read to slip through while eviction was occuring or there is some buggy behaviour around multiple evictions occuring at the same time (this can happen because the clean zones thing is non-blocking).

sam cheng and others added 2 commits August 28, 2025 19:51
We had a race condition due to the inconsistency of the bucket map and underlying data

The bucket map would have had to be held during reads and writes to avoid this race condition

Implement a "pin counter", this allows us to check if the data is currently being used, and we only evict when it reaches zero.

The issue is the inconsistency between the entry and the bucket map. By doing it this way any intermediate locations will remain correct until they are done reading.

If they read while we are in process of evicting they will wait on the waiting state.
@johnramsden johnramsden merged commit a06c89a into main Sep 11, 2025
2 checks passed
@johnramsden johnramsden deleted the sam/chunk-eviction-fix branch September 11, 2025 01:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants