Skip to content

Conversation

@sudeepdino008
Copy link
Member

@sudeepdino008 sudeepdino008 commented Oct 7, 2025

  • domain file cache today is rotx level. Different rotx each get their own cache.
  • this change ties the domain cache to domain.visibleFiles; the same cache is shared across all rotx
    • only need to create new cache when visibleFiles change
    • even when recalcVisibleFiles is done, ongoing domainRotx will continue having the older cache, which is correct behavior.
  • this should enable greater cache use. The period is "as long as a new file is not built/merged" which happens every 1 step - a significant period.
  • not using pool here because:
    • existing rotx using cache should be safe, using pool can cause a rugpull and is unsafe.
    • new cache is create rarely (every 1 step) which is approx 16 hours on mainnet...so we can rely on GC to collect those.

@sudeepdino008 sudeepdino008 changed the title persist domain cache across domain_rotx wip: persist domain cache across domain_rotx Oct 7, 2025
@AskAlexSharov
Copy link
Collaborator

github.com/elastic/go-freelru is not thread-safe

@AskAlexSharov
Copy link
Collaborator

Main problem of D_LRU:

  • size of account value is << size of Code value
  • CommitmentDomain shows near-zero hit-rate
  • if tide 1 lru visibleFiles - then lru need mutex (because lru is mutable) - and this mutex will be shared across all RPC requests (and other reads) - (maybe good maybe bad - need measure latency/throughput)

@sudeepdino008
Copy link
Member Author

sudeepdino008 commented Oct 7, 2025

github.com/elastic/go-freelru is not thread-safe

DomainGetFromFileCache is RWMutex and so thread safe. But we don't need it - github.com/elastic/go-freelru has a ShardedLRU as well, which would be better than RWMutex lock because it uses sharding...

size of account value is << size of Code value

I agree. We do set lesser (1/10th) limit for code cache.
What we can also do - ONLY keep the level information in code cache (no code value), and then directly go into that level. Though that just skips kvei, which is not that big.
If we can store level + offset in kv in code cache (and some change in BpsTree API), we can bypass bps lookup as well.

Maybe this cache can exist along with a much smaller "value storing code cache". Dunno this needs more benchmarking. Code domain reads is the most time consuming.

if tide 1 lru visibleFiles - then lru need mutex (because lru is mutable) - and this mutex will be shared across all RPC requests (and other reads) - (maybe good maybe bad - need measure latency/throughput)

SharedLRU should be nice here. Agreed that we would need some RPC latency/throughput measurements with this. I can check rpctest benches. It compares geth vs erigon, but maybe I can just take - geth vs erigon1 and geth vs erigon2
and then compare erigon1 vs erigon2 numbers. Do you have anything else in mind? Any other existing bench I can use?

@AskAlexSharov
Copy link
Collaborator

I agree. We do set lesser (1/10th) limit for code cache. problem is: 1 Code can be 50kb

@AskAlexSharov
Copy link
Collaborator

ShardedLRU - oke. But still - need to bench? Because now don't have mutexes.

If we can store level + offset - we already storing level in domainGetFromFileCacheItem. Switch to offset - yes.

@AskAlexSharov
Copy link
Collaborator

@sudeepdino008 FYI: QA team also have some rpc throughput monitoring/suite: https://monitoring.erigon.io/d/ddqiwbfvrgwlcd/erigonqa?orgId=1&from=now-30d&to=now&timezone=browser
(i don't much about it yet)

@AskAlexSharov
Copy link
Collaborator

Also: rpctest can produce file for vegetta

@sudeepdino008
Copy link
Member Author

sudeepdino008 commented Oct 14, 2025

the throughput actually gets better for QPS=10k (remains kind of the same for lower QPS)

used run_perf_tests.py; full report here.

for QPS=10k

in this branch:
p99 ~ 6.3s

in main:
p99 ~ 100s


  • i don't think we need the level+offset optimization for code -- even with the small size, it has good cache hit ratio (for chaintip/perftest workload)
  • disabled cache for commitment
  • use smaller size for code AND rcache

@sudeepdino008 sudeepdino008 changed the title wip: persist domain cache across domain_rotx persist domain cache across domain_rotx Oct 14, 2025
Copy link
Collaborator

@AskAlexSharov AskAlexSharov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will review tomorrow

@AskAlexSharov
Copy link
Collaborator

the throughput actually gets better for QPS=10k (remains kind of the same for lower QPS)

used run_perf_tests.py; full report here.

for QPS=10k

in this branch: p99 ~ 6.3s

in main: p99 ~ 100s

  • i don't think we need the level+offset optimization for code -- even with the small size, it has good cache hit ratio (for chaintip/perftest workload)
  • disabled cache for commitment
  • use smaller size for code AND rcache

In full report i don’t see 100s in main

@sudeepdino008
Copy link
Member Author

the throughput actually gets better for QPS=10k (remains kind of the same for lower QPS)
used run_perf_tests.py; full report here.
for QPS=10k
in this branch: p99 ~ 6.3s
in main: p99 ~ 100s

  • i don't think we need the level+offset optimization for code -- even with the small size, it has good cache hit ratio (for chaintip/perftest workload)
  • disabled cache for commitment
  • use smaller size for code AND rcache

In full report i don’t see 100s in main

oh my bad; had some bad flag in rpcdaemon - https://gist.github.com/sudeepdino008/525df078c2566765e08c6f6a94d65378

they both are are similar perf for 10k and 100k QPS

@AskAlexSharov
Copy link
Collaborator

@sudeepdino008 please disable http compression in rpcd - maybe it's a bottleneck now.

@sudeepdino008
Copy link
Member Author

@sudeepdino008 please disable http compression in rpcd - maybe it's a bottleneck now.

added results with http.compression=false (towards the end ) in https://gist.github.com/sudeepdino008/525df078c2566765e08c6f6a94d65378

seems similar perf again.

Copy link
Contributor

@mh0lt mh0lt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure you have this cache in the right place. From the caller of the shared domains'perspective I think you want to hit the cache before you attempt to do a db then a file look-up.

I also think we want to be able to gather metrics at a domain level as to overall cache size and rate. Maybe we want different behavior for exec vs rpc. We actually don't know yet which is why we need graphana based metrics.

bear in mind from an exec time perspective we want to measure puts & gets from the application perspective. Although macro level cache stats are interesting they are not necessarily the most useful.

@mh0lt
Copy link
Contributor

mh0lt commented Oct 15, 2025

Also I think is we are going down this road we should be using: https://pkg.go.dev/weak

It will help with memory management as it means the GC can be used to clean the cache.

@AskAlexSharov
Copy link
Collaborator

AskAlexSharov commented Oct 16, 2025

I'm not sure you have this cache in the right place. From the caller of the shared domains'perspective I think you want to hit the cache before you attempt to do a db then a file look-up.

Such users working with biz-logic and real applications: Exec, RPC, P2P, etc... all this things need different caches (different queries). All this caches will suffer from invalidation bugs.

Cache on immutable files: it's very different animal - it doesn't need invalidation (drop cache when new files available).

Caches on top of Mdbx - is usually starvation - because MDBX (PageCache - is already an LRU).

So, In my head:

  • good biz-logic-level cache: BlocksLRU, ReceiptsLRU, etc... where we store complex objects after complex calculations: read various tables, calculating txn.Hash(), etc...
  • good low-level cache: is the cache which rely on "how data stored on disk" knowledge, which can store pointers to files (MMAP) which can be encoded as (level u8, offset u64) tuple instead of pointers, and do other low-level tricks, etc... Because biz-logic doesn't need all this knowledge "have we read value from mdbx, or from files. from compressed files, or from uncompressed files, etc..."

@AskAlexSharov
Copy link
Collaborator

several level of caching - is own evil for benchmarks reproducibility. but I'm interesting - what doors are opened for us after moving 95% of data to immutable files.

@sudeepdino008
Copy link
Member Author

  • a cache on top of files is a good idea; among all tiers, files read is the slowest. Also it transfers well to other "applications" which don't use shared domains.
  • file read (even with this cache) seems to take the most time to read (db+cache is sub-1microsec level, while file read is 8microsec (while doing stage_exec of 3000 blocks)).
  • if we do have a file-level cache, then it's unlikely that sd read cache will reap much benefit - since the other part (db) is pretty fast as well.

application specific metrics: I like the idea, but the only way to do this seems to be via running different processes. Otherwise in a single process running both rpc and exec, for example - If we have a low level cache like file cache, then one app can impact the (common low-level) cache contents and the metrics aren't clean at that point.

@AskAlexSharov
Copy link
Collaborator

@sudeepdino008 please disable http compression in rpcd - maybe it's a bottleneck now.

added results with http.compression=false (towards the end ) in https://gist.github.com/sudeepdino008/525df078c2566765e08c6f6a94d65378

seems similar perf again.

Does it means: "perf of 2 D_LRU implementations is similar" or "bottleneck is completely in another place in both cases"?

@mh0lt
Copy link
Contributor

mh0lt commented Oct 20, 2025

  • a cache on top of files is a good idea; among all tiers, files read is the slowest. Also it transfers well to other "applications" which don't use shared domains.
  • file read (even with this cache) seems to take the most time to read (db+cache is sub-1microsec level, while file read is 8microsec (while doing stage_exec of 3000 blocks)).
  • if we do have a file-level cache, then it's unlikely that sd read cache will reap much benefit - since the other part (db) is pretty fast as well.

application specific metrics: I like the idea, but the only way to do this seems to be via running different processes. Otherwise in a single process running both rpc and exec, for example - If we have a low level cache like file cache, then one app can impact the (common low-level) cache contents and the metrics aren't clean at that point.

This doesn't need several processed - it just needs the metics incorporated in the right place in the app - which I'm doing for execution, we should be doing the same metrics for the RPC layer.

I think the point here is we need to see how each layer is impacted by the other. mI personally don't see a great deal of value in running individual benchmarks. they are useful for comparative purposes in development but don't give a good picture of the overall workload.

I think your current benchmaking is a bit unrealistic. stage_exec does not really operate on its own, and what we really need to do is test the cache behaviour unser realistic load conditions. The typical number I see when testing for the db are 8-12 micro's. This is signifigantly more that you are seeing.

What I'm saying is that I don't agree with your assertion that the db is fast, this is only the case on a well resourced machine. I really think we should be aiming for stable behaviour on loaded machines - which means avoid all page accessing layers - which is both db + files. I agree that the files are generally worse than the db - but both can be bad by an oreder of magnitide compared to memory access.

@mh0lt
Copy link
Contributor

mh0lt commented Oct 20, 2025

several level of caching - is own evil for benchmarks reproducibility. but I'm interesting - what doors are opened for us after moving 95% of data to immutable files.

I'm ok with a multy layer cache approach, but then I think that we need to provide access to the lower layer cache for status and metrics purposes - otherwise we either duplicate cached data or under use the high level cache.

I think that the overall problem we have at the moment is that we tend to put all our optimization in one place - the DB layer. I think that this is too simplistic approach. I think we need to optimize in both directions.

@sudeepdino008
Copy link
Member Author

@sudeepdino008 please disable http compression in rpcd - maybe it's a bottleneck now.

added results with http.compression=false (towards the end ) in https://gist.github.com/sudeepdino008/525df078c2566765e08c6f6a94d65378
seems similar perf again.

Does it means: "perf of 2 D_LRU implementations is similar" or "bottleneck is completely in another place in both cases"?

I was just eyeballing. Let me do a detailed post...

RPC Throughput/Latency:

  • for QPS=10k , domain_cache is better. Note that this is the highest QPS value (in default case) we run run_perf_tests.py for.
  • for QPS=100k, we start to hit the bottleneck and latency is worse. This is probably the sharedLRU bottleneck starts to creep in.

Mgas/sec:
goes up 7-10%

serial exec:   361 -> 387 mgas/s
parallel exec: 407 -> 458 mgas/s

===========================================
PERFORMANCE COMPARISON: domain_cache vs main

===========================================
QPS: 10,000

Metric domain_cache (mean) main (mean) Diff % Winner

p50 226.24us 232.88us +2.9% domain_cache ✓
p90 467.53us 789.17us +68.8% domain_cache ✓
p95 846.45us 2.23ms +163.9% domain_cache ✓
p99 29.48ms 29.97ms +1.6% domain_cache ✓
max 133.98ms 164.44ms +22.7% domain_cache ✓

============================================
QPS: 100,000

Metric domain_cache (mean) main (mean) Diff % Winner

p50 18.493s 15.475s -16.3% main ✓
p90 32.591s 28.906s -11.3% main ✓
p95 34.059s 30.541s -10.3% main ✓
p99 35.289s 31.853s -9.7% main ✓
max 35.911s 32.489s -9.5% main ✓

=============================================
SUMMARY

At 10,000 QPS (lower latency scenario):
• Both branches perform similarly for p50 (~224-226us)
• domain_cache shows better p90-p95 latencies in most runs
• domain_cache has slightly better p99 and max latencies

At 100,000 QPS (high load scenario):
• main shows better p50 performance (15-18s vs 19-24s)
• main has consistently better p90-p99 latencies
• main handles high load better overall

@AskAlexSharov
Copy link
Collaborator

tnx for detailed analysis

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants