Conversation
This comment was marked as off-topic.
This comment was marked as off-topic.
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
861368a to
46832f3
Compare
|
|
||
| ## Non-goals | ||
|
|
||
| - **Maintaining local filesystem cache**: Global cache is cloud storage only |
There was a problem hiding this comment.
For what it's worth, it's actually quite easy to support local filesystems because nf-cloudcache works out-of-the-box with them. We only disallow it in the runtime, but #6100 re-allows it to help with testing.
The only consideration is handling race conditions. I assume you could just use regular file locks.
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
|
I'd like to explore an approach based on File Content Sketch + Bloom Filter. This should allow avoding re-hashing large files by caching Components:
Flow:
|
|
I will have a look about the file sketches, the trick of this solution is to find a good sample with very low chance of collision of file sketches that could produce false positives in the global cache. |
|
Worth giving a try! |
|
I look deeper in the proposed solution:
|
Not really, S3 api allows to to access arbitrary file chunks |
Add ADR for Global Cache Feature
This PR introduces an Architecture Decision Record (ADR) for a global cache system that enables cross-pipeline task result sharing in Nextflow.
Overview
The global cache extends Nextflow's existing task caching mechanism to allow different pipeline executions to reuse computational results across:
Key Design Decisions
Architecture:
nf-cloudcacheplugin infrastructureContent-Addressable Hashing:
sessionIdandprocessNamefrom task hash computationConcurrency Control:
Optimizations Considered:
lid) reuse to reduce checksum computationsImplementation Phases
Phase 0 (Proof of Concept - #6100):
Future Phases:
Trade-offs and Limitations
Related Issues
Status: Draft design document for discussion and feedback
Version: 1.0