ADR: Global cache by jorgee · Pull Request #6796 · nextflow-io/nextflow

jorgee · 2026-02-03T09:29:24Z

Add ADR for Global Cache Feature

This PR introduces an Architecture Decision Record (ADR) for a global cache system that enables cross-pipeline task result sharing in Nextflow.

Overview

The global cache extends Nextflow's existing task caching mechanism to allow different pipeline executions to reuse computational results across:

Different users running the same analysis
Development and production pipeline versions
Parameter sweeps with shared preprocessing steps
Multi-environment deployments

Key Design Decisions

Architecture:

Builds on existing nf-cloudcache plugin infrastructure
Uses cloud object storage (S3, GCS, Azure Blob) as backend
Leverages strong consistency guarantees for concurrent access control

Content-Addressable Hashing:

Removes sessionId and processName from task hash computation
Enables content-based file hashing instead of path-based hashing
Allows identical tasks to share cache regardless of pipeline or session

Concurrency Control:

Simple collision-avoidance strategy using atomic cloud storage operations
Conditional PUT with preconditions (If-None-Match, ifGenerationMatch=0)
Hash increment on collision rather than waiting/polling (trades rare cache misses for simplicity)

Optimizations Considered:

Lineage ID (lid) reuse to reduce checksum computations
Cloud storage native checksums for hash computation

Implementation Phases

Phase 0 (Proof of Concept - #6100):

Associate nf-cloudcache path with global cache path
Use constant sessionId, remove processName from hash
Optional deep cache mode

Future Phases:

Content-based file hashing implementation
Cloud storage atomic lock acquisition
Configuration options and cleanup commands

Trade-offs and Limitations

Performance: Content hashing overhead for large files (mitigated by proposed optimizations)
Concurrency: Simultaneous execution of identical tasks results in redundant work (~1% of cases)
Compatibility: Conflicts with planned automatic workflow cleanup feature
Storage: Cloud storage costs (offset by compute savings from cache hits)

Related Issues

Proof of Concept: POC: Global Cache #6100

Status: Draft design document for discussion and feedback
Version: 1.0

Signed-off-by: jorgee <jorge.ejarque@seqera.io>

bentsherman · 2026-02-03T15:50:01Z

adr/20260202-global-cache.md

+
+## Non-goals
+
+- **Maintaining local filesystem cache**: Global cache is cloud storage only


For what it's worth, it's actually quite easy to support local filesystems because nf-cloudcache works out-of-the-box with them. We only disallow it in the runtime, but #6100 re-allows it to help with testing.

The only consideration is handling race conditions. I assume you could just use regular file locks.

Signed-off-by: jorgee <jorge.ejarque@seqera.io>

pditommaso · 2026-02-05T10:11:29Z

I'd like to explore an approach based on File Content Sketch + Bloom Filter.

This should allow avoding re-hashing large files by caching sketch → fullHash mappings, with a Bloom filter as fast pre-check.

Components:

Sketch→Hash store: persistent mapping of file sketches to their full BLAKE3 hashes
Bloom filter: fast check to avoid unnecessary store lookups

Flow:

Compute cheap sketch: hash(size, first4KB, last4KB, middle4KB) — ~1ms any file size
Check bloomFilter.mightContain(sketch)
- NO → sketch definitely not in store, compute full BLAKE3 hash, save mapping, update Bloom filter
- MAYBE → lookup sketch in store
  - Found → return cached full hash (skip expensive BLAKE3)
  - Not found (false positive) → compute full BLAKE3 hash, save mapping

jorgee · 2026-02-05T10:58:33Z

I will have a look about the file sketches, the trick of this solution is to find a good sample with very low chance of collision of file sketches that could produce false positives in the global cache.

pditommaso · 2026-02-05T10:59:33Z

Worth giving a try!

jorgee · 2026-02-05T12:28:17Z

I look deeper in the proposed solution:

I see some redundancy in the solution. Bloom filter and file sketch store are good to ensure when a file has no appear and it is mandatory to calculate the hash, but what is incorrect in the solution is using it to reuse the hash. If two different files can generate the same sketch, when going to the store (either through the bloom filter or not), you will get the same hash. Then, if we are allowing this fact, why not use the sketch as hash? The result will be the same. Then, there is no need for the store and bloom filter.
Regarding computation of sketches, options with a low chance of collision (minHash,..) require to access different blocks across the file. When computing them for cloud storage, either we need to download the whole file or do several calls, so it will not be fast.

pditommaso · 2026-02-05T13:33:52Z

When computing them for cloud storage, either we need to download the whole file

Not really, S3 api allows to to access arbitrary file chunks

This comment was marked as off-topic.

Sign in to view

initial draft [ci skip]

46832f3

Signed-off-by: jorgee <jorge.ejarque@seqera.io>

jorgee force-pushed the 20260202-global-cache branch from 861368a to 46832f3 Compare February 3, 2026 10:56

bentsherman reviewed Feb 3, 2026

View reviewed changes

update input file hashing

301965b

Signed-off-by: jorgee <jorge.ejarque@seqera.io>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ADR: Global cache#6796

ADR: Global cache#6796
jorgee wants to merge 2 commits intomasterfrom
20260202-global-cache

jorgee commented Feb 3, 2026 •

edited

Loading

Uh oh!

This comment was marked as off-topic.

bentsherman Feb 3, 2026

Uh oh!

pditommaso commented Feb 5, 2026

Uh oh!

jorgee commented Feb 5, 2026

Uh oh!

pditommaso commented Feb 5, 2026

Uh oh!

jorgee commented Feb 5, 2026

Uh oh!

pditommaso commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		## Non-goals

		- Maintaining local filesystem cache: Global cache is cloud storage only

Conversation

jorgee commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add ADR for Global Cache Feature

Overview

Key Design Decisions

Implementation Phases

Trade-offs and Limitations

Related Issues

Uh oh!

This comment was marked as off-topic.

bentsherman Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

pditommaso commented Feb 5, 2026

Uh oh!

jorgee commented Feb 5, 2026

Uh oh!

pditommaso commented Feb 5, 2026

Uh oh!

jorgee commented Feb 5, 2026

Uh oh!

pditommaso commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jorgee commented Feb 3, 2026 •

edited

Loading