persist domain cache across domain_rotx #17362

sudeepdino008 · 2025-10-07T06:09:14Z

domain file cache today is rotx level. Different rotx each get their own cache.
this change ties the domain cache to domain.visibleFiles; the same cache is shared across all rotx
- only need to create new cache when visibleFiles change
- even when recalcVisibleFiles is done, ongoing domainRotx will continue having the older cache, which is correct behavior.
this should enable greater cache use. The period is "as long as a new file is not built/merged" which happens every 1 step - a significant period.
not using pool here because:
- existing rotx using cache should be safe, using pool can cause a rugpull and is unsafe.
- new cache is create rarely (every 1 step) which is approx 16 hours on mainnet...so we can rely on GC to collect those.

AskAlexSharov · 2025-10-07T08:33:57Z

github.com/elastic/go-freelru is not thread-safe

AskAlexSharov · 2025-10-07T08:37:54Z

Main problem of D_LRU:

size of account value is << size of Code value
CommitmentDomain shows near-zero hit-rate
if tide 1 lru visibleFiles - then lru need mutex (because lru is mutable) - and this mutex will be shared across all RPC requests (and other reads) - (maybe good maybe bad - need measure latency/throughput)

sudeepdino008 · 2025-10-07T11:07:24Z

github.com/elastic/go-freelru is not thread-safe

DomainGetFromFileCache is RWMutex and so thread safe. But we don't need it - github.com/elastic/go-freelru has a ShardedLRU as well, which would be better than RWMutex lock because it uses sharding...

size of account value is << size of Code value

I agree. We do set lesser (1/10th) limit for code cache.
What we can also do - ONLY keep the level information in code cache (no code value), and then directly go into that level. Though that just skips kvei, which is not that big.
If we can store level + offset in kv in code cache (and some change in BpsTree API), we can bypass bps lookup as well.

Maybe this cache can exist along with a much smaller "value storing code cache". Dunno this needs more benchmarking. Code domain reads is the most time consuming.

if tide 1 lru visibleFiles - then lru need mutex (because lru is mutable) - and this mutex will be shared across all RPC requests (and other reads) - (maybe good maybe bad - need measure latency/throughput)

SharedLRU should be nice here. Agreed that we would need some RPC latency/throughput measurements with this. I can check rpctest benches. It compares geth vs erigon, but maybe I can just take - geth vs erigon1 and geth vs erigon2
and then compare erigon1 vs erigon2 numbers. Do you have anything else in mind? Any other existing bench I can use?

AskAlexSharov · 2025-10-08T03:31:55Z

I agree. We do set lesser (1/10th) limit for code cache. problem is: 1 Code can be 50kb

AskAlexSharov · 2025-10-08T03:35:59Z

ShardedLRU - oke. But still - need to bench? Because now don't have mutexes.

If we can store level + offset - we already storing level in domainGetFromFileCacheItem. Switch to offset - yes.

AskAlexSharov · 2025-10-08T03:38:27Z

@sudeepdino008 FYI: QA team also have some rpc throughput monitoring/suite: https://monitoring.erigon.io/d/ddqiwbfvrgwlcd/erigonqa?orgId=1&from=now-30d&to=now&timezone=browser
(i don't much about it yet)

AskAlexSharov · 2025-10-08T03:39:59Z

Also: rpctest can produce file for vegetta

cmd/scripts/exec_bench/exec_bench.sh

sudeepdino008 · 2025-10-14T11:09:34Z

the throughput actually gets better for QPS=10k (remains kind of the same for lower QPS)

used run_perf_tests.py; full report here.

for QPS=10k

in this branch:
~~p99 ~ 6.3s~~

in main:
~~p99 ~ 100s~~

i don't think we need the level+offset optimization for code -- even with the small size, it has good cache hit ratio (for chaintip/perftest workload)
disabled cache for commitment
use smaller size for code AND rcache

AskAlexSharov

Will review tomorrow

AskAlexSharov · 2025-10-14T11:24:50Z

the throughput actually gets better for QPS=10k (remains kind of the same for lower QPS)

used run_perf_tests.py; full report here.

for QPS=10k

in this branch: p99 ~ 6.3s

in main: p99 ~ 100s

i don't think we need the level+offset optimization for code -- even with the small size, it has good cache hit ratio (for chaintip/perftest workload)

disabled cache for commitment

use smaller size for code AND rcache

In full report i don’t see 100s in main

sudeepdino008 · 2025-10-14T14:53:03Z

the throughput actually gets better for QPS=10k (remains kind of the same for lower QPS)
used run_perf_tests.py; full report here.
for QPS=10k
in this branch: p99 ~ 6.3s
in main: p99 ~ 100s

i don't think we need the level+offset optimization for code -- even with the small size, it has good cache hit ratio (for chaintip/perftest workload)

disabled cache for commitment

use smaller size for code AND rcache

In full report i don’t see 100s in main

oh my bad; had some bad flag in rpcdaemon - https://gist.github.com/sudeepdino008/525df078c2566765e08c6f6a94d65378

they both are are similar perf for 10k and 100k QPS

AskAlexSharov · 2025-10-15T03:02:16Z

@sudeepdino008 please disable http compression in rpcd - maybe it's a bottleneck now.

sudeepdino008 · 2025-10-15T05:47:14Z

@sudeepdino008 please disable http compression in rpcd - maybe it's a bottleneck now.

added results with http.compression=false (towards the end ) in https://gist.github.com/sudeepdino008/525df078c2566765e08c6f6a94d65378

seems similar perf again.

mh0lt

I'm not sure you have this cache in the right place. From the caller of the shared domains'perspective I think you want to hit the cache before you attempt to do a db then a file look-up.

I also think we want to be able to gather metrics at a domain level as to overall cache size and rate. Maybe we want different behavior for exec vs rpc. We actually don't know yet which is why we need graphana based metrics.

bear in mind from an exec time perspective we want to measure puts & gets from the application perspective. Although macro level cache stats are interesting they are not necessarily the most useful.

mh0lt · 2025-10-15T14:03:38Z

Also I think is we are going down this road we should be using: https://pkg.go.dev/weak

It will help with memory management as it means the GC can be used to clean the cache.

AskAlexSharov · 2025-10-16T00:42:40Z

I'm not sure you have this cache in the right place. From the caller of the shared domains'perspective I think you want to hit the cache before you attempt to do a db then a file look-up.

Such users working with biz-logic and real applications: Exec, RPC, P2P, etc... all this things need different caches (different queries). All this caches will suffer from invalidation bugs.

Cache on immutable files: it's very different animal - it doesn't need invalidation (drop cache when new files available).

Caches on top of Mdbx - is usually starvation - because MDBX (PageCache - is already an LRU).

So, In my head:

good biz-logic-level cache: BlocksLRU, ReceiptsLRU, etc... where we store complex objects after complex calculations: read various tables, calculating txn.Hash(), etc...
good low-level cache: is the cache which rely on "how data stored on disk" knowledge, which can store pointers to files (MMAP) which can be encoded as (level u8, offset u64) tuple instead of pointers, and do other low-level tricks, etc... Because biz-logic doesn't need all this knowledge "have we read value from mdbx, or from files. from compressed files, or from uncompressed files, etc..."

AskAlexSharov · 2025-10-16T00:52:37Z

several level of caching - is own evil for benchmarks reproducibility. but I'm interesting - what doors are opened for us after moving 95% of data to immutable files.

sudeepdino008 · 2025-10-16T08:04:40Z

a cache on top of files is a good idea; among all tiers, files read is the slowest. Also it transfers well to other "applications" which don't use shared domains.
file read (even with this cache) seems to take the most time to read (db+cache is sub-1microsec level, while file read is 8microsec (while doing stage_exec of 3000 blocks)).
if we do have a file-level cache, then it's unlikely that sd read cache will reap much benefit - since the other part (db) is pretty fast as well.

application specific metrics: I like the idea, but the only way to do this seems to be via running different processes. Otherwise in a single process running both rpc and exec, for example - If we have a low level cache like file cache, then one app can impact the (common low-level) cache contents and the metrics aren't clean at that point.

AskAlexSharov · 2025-10-20T02:12:45Z

@sudeepdino008 please disable http compression in rpcd - maybe it's a bottleneck now.

added results with http.compression=false (towards the end ) in https://gist.github.com/sudeepdino008/525df078c2566765e08c6f6a94d65378

seems similar perf again.

Does it means: "perf of 2 D_LRU implementations is similar" or "bottleneck is completely in another place in both cases"?

mh0lt · 2025-10-20T10:39:34Z

a cache on top of files is a good idea; among all tiers, files read is the slowest. Also it transfers well to other "applications" which don't use shared domains.

file read (even with this cache) seems to take the most time to read (db+cache is sub-1microsec level, while file read is 8microsec (while doing stage_exec of 3000 blocks)).

if we do have a file-level cache, then it's unlikely that sd read cache will reap much benefit - since the other part (db) is pretty fast as well.

application specific metrics: I like the idea, but the only way to do this seems to be via running different processes. Otherwise in a single process running both rpc and exec, for example - If we have a low level cache like file cache, then one app can impact the (common low-level) cache contents and the metrics aren't clean at that point.

This doesn't need several processed - it just needs the metics incorporated in the right place in the app - which I'm doing for execution, we should be doing the same metrics for the RPC layer.

I think the point here is we need to see how each layer is impacted by the other. mI personally don't see a great deal of value in running individual benchmarks. they are useful for comparative purposes in development but don't give a good picture of the overall workload.

I think your current benchmaking is a bit unrealistic. stage_exec does not really operate on its own, and what we really need to do is test the cache behaviour unser realistic load conditions. The typical number I see when testing for the db are 8-12 micro's. This is signifigantly more that you are seeing.

What I'm saying is that I don't agree with your assertion that the db is fast, this is only the case on a well resourced machine. I really think we should be aiming for stable behaviour on loaded machines - which means avoid all page accessing layers - which is both db + files. I agree that the files are generally worse than the db - but both can be bad by an oreder of magnitide compared to memory access.

mh0lt · 2025-10-20T10:44:02Z

several level of caching - is own evil for benchmarks reproducibility. but I'm interesting - what doors are opened for us after moving 95% of data to immutable files.

I'm ok with a multy layer cache approach, but then I think that we need to provide access to the lower layer cache for status and metrics purposes - otherwise we either duplicate cached data or under use the high level cache.

I think that the overall problem we have at the moment is that we tend to put all our optimization in one place - the DB layer. I think that this is too simplistic approach. I think we need to optimize in both directions.

sudeepdino008 · 2025-10-27T11:32:05Z

@sudeepdino008 please disable http compression in rpcd - maybe it's a bottleneck now.

added results with http.compression=false (towards the end ) in https://gist.github.com/sudeepdino008/525df078c2566765e08c6f6a94d65378
seems similar perf again.

Does it means: "perf of 2 D_LRU implementations is similar" or "bottleneck is completely in another place in both cases"?

I was just eyeballing. Let me do a detailed post...

RPC Throughput/Latency:

for QPS=10k , domain_cache is better. Note that this is the highest QPS value (in default case) we run run_perf_tests.py for.
for QPS=100k, we start to hit the bottleneck and latency is worse. This is probably the sharedLRU bottleneck starts to creep in.

Mgas/sec:
goes up 7-10%

serial exec:   361 -> 387 mgas/s
parallel exec: 407 -> 458 mgas/s

===========================================
PERFORMANCE COMPARISON: domain_cache vs main

===========================================
QPS: 10,000

Metric domain_cache (mean) main (mean) Diff % Winner

p50 226.24us 232.88us +2.9% domain_cache ✓
p90 467.53us 789.17us +68.8% domain_cache ✓
p95 846.45us 2.23ms +163.9% domain_cache ✓
p99 29.48ms 29.97ms +1.6% domain_cache ✓
max 133.98ms 164.44ms +22.7% domain_cache ✓

============================================
QPS: 100,000

Metric domain_cache (mean) main (mean) Diff % Winner

p50 18.493s 15.475s -16.3% main ✓
p90 32.591s 28.906s -11.3% main ✓
p95 34.059s 30.541s -10.3% main ✓
p99 35.289s 31.853s -9.7% main ✓
max 35.911s 32.489s -9.5% main ✓

=============================================
SUMMARY

At 10,000 QPS (lower latency scenario):
• Both branches perform similarly for p50 (~224-226us)
• domain_cache shows better p90-p95 latencies in most runs
• domain_cache has slightly better p99 and max latencies

At 100,000 QPS (high load scenario):
• main shows better p50 performance (15-18s vs 19-24s)
• main has consistently better p90-p99 latencies
• main handles high load better overall

AskAlexSharov · 2025-10-28T03:36:17Z

tnx for detailed analysis

sudeepdino008 changed the title ~~persist domain cache across domain_rotx~~ wip: persist domain cache across domain_rotx Oct 7, 2025

sudeepdino008 mentioned this pull request Oct 7, 2025

explore reducing domain file read times via better domain cache use #17379

Closed

sudeepdino008 added 4 commits October 8, 2025 10:23

save

bc90a90

save

58df8d3

save

76b5e17

save

cc5a420

sudeepdino008 force-pushed the domain_cache branch from cb70ebd to cc5a420 Compare October 8, 2025 04:53

sudeepdino008 added 3 commits October 8, 2025 11:14

save

81e6999

save

42fd1bb

save

d8cff59

sudeepdino008 commented Oct 9, 2025

View reviewed changes

cmd/scripts/exec_bench/exec_bench.sh Outdated Show resolved Hide resolved

sudeepdino008 added 4 commits October 10, 2025 11:29

save

ce46ace

save

d0b678f

Merge remote-tracking branch 'origin/main' into domain_cache

9923459

save

f008529

sudeepdino008 changed the title ~~wip: persist domain cache across domain_rotx~~ persist domain cache across domain_rotx Oct 14, 2025

sudeepdino008 requested review from AskAlexSharov and mh0lt October 14, 2025 11:10

AskAlexSharov approved these changes Oct 14, 2025

View reviewed changes

AskAlexSharov requested changes Oct 14, 2025

View reviewed changes

mh0lt reviewed Oct 15, 2025

View reviewed changes

save

3ea184c

AskAlexSharov closed this Oct 28, 2025

persist domain cache across domain_rotx #17362

persist domain cache across domain_rotx #17362

Uh oh!

Conversation

sudeepdino008 commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AskAlexSharov commented Oct 7, 2025

Uh oh!

AskAlexSharov commented Oct 7, 2025

Uh oh!

sudeepdino008 commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AskAlexSharov commented Oct 8, 2025

Uh oh!

AskAlexSharov commented Oct 8, 2025

Uh oh!

AskAlexSharov commented Oct 8, 2025

Uh oh!

AskAlexSharov commented Oct 8, 2025

Uh oh!

Uh oh!

sudeepdino008 commented Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AskAlexSharov left a comment

Choose a reason for hiding this comment

Uh oh!

AskAlexSharov commented Oct 14, 2025

Uh oh!

sudeepdino008 commented Oct 14, 2025

Uh oh!

AskAlexSharov commented Oct 15, 2025

Uh oh!

sudeepdino008 commented Oct 15, 2025

Uh oh!

mh0lt left a comment

Choose a reason for hiding this comment

Uh oh!

mh0lt commented Oct 15, 2025

Uh oh!

AskAlexSharov commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AskAlexSharov commented Oct 16, 2025

Uh oh!

sudeepdino008 commented Oct 16, 2025

Uh oh!

AskAlexSharov commented Oct 20, 2025

Uh oh!

mh0lt commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mh0lt commented Oct 20, 2025

Uh oh!

sudeepdino008 commented Oct 27, 2025

=========================================== PERFORMANCE COMPARISON: domain_cache vs main

=========================================== QPS: 10,000

Metric domain_cache (mean) main (mean) Diff % Winner

============================================ QPS: 100,000

Metric domain_cache (mean) main (mean) Diff % Winner

============================================= SUMMARY

Uh oh!

AskAlexSharov commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

sudeepdino008 commented Oct 7, 2025 •

edited

Loading

sudeepdino008 commented Oct 7, 2025 •

edited

Loading

sudeepdino008 commented Oct 14, 2025 •

edited

Loading

AskAlexSharov commented Oct 16, 2025 •

edited

Loading

mh0lt commented Oct 20, 2025 •

edited

Loading

===========================================
PERFORMANCE COMPARISON: domain_cache vs main

===========================================
QPS: 10,000

============================================
QPS: 100,000

=============================================
SUMMARY