Skip to content

ADR: Global cache#6796

Draft
jorgee wants to merge 2 commits intomasterfrom
20260202-global-cache
Draft

ADR: Global cache#6796
jorgee wants to merge 2 commits intomasterfrom
20260202-global-cache

Conversation

@jorgee
Copy link
Contributor

@jorgee jorgee commented Feb 3, 2026

Add ADR for Global Cache Feature

This PR introduces an Architecture Decision Record (ADR) for a global cache system that enables cross-pipeline task result sharing in Nextflow.

Overview

The global cache extends Nextflow's existing task caching mechanism to allow different pipeline executions to reuse computational results across:

  • Different users running the same analysis
  • Development and production pipeline versions
  • Parameter sweeps with shared preprocessing steps
  • Multi-environment deployments

Key Design Decisions

Architecture:

  • Builds on existing nf-cloudcache plugin infrastructure
  • Uses cloud object storage (S3, GCS, Azure Blob) as backend
  • Leverages strong consistency guarantees for concurrent access control

Content-Addressable Hashing:

  • Removes sessionId and processName from task hash computation
  • Enables content-based file hashing instead of path-based hashing
  • Allows identical tasks to share cache regardless of pipeline or session

Concurrency Control:

  • Simple collision-avoidance strategy using atomic cloud storage operations
  • Conditional PUT with preconditions (If-None-Match, ifGenerationMatch=0)
  • Hash increment on collision rather than waiting/polling (trades rare cache misses for simplicity)

Optimizations Considered:

  • Lineage ID (lid) reuse to reduce checksum computations
  • Cloud storage native checksums for hash computation

Implementation Phases

Phase 0 (Proof of Concept - #6100):

  • Associate nf-cloudcache path with global cache path
  • Use constant sessionId, remove processName from hash
  • Optional deep cache mode

Future Phases:

  • Content-based file hashing implementation
  • Cloud storage atomic lock acquisition
  • Configuration options and cleanup commands

Trade-offs and Limitations

  • Performance: Content hashing overhead for large files (mitigated by proposed optimizations)
  • Concurrency: Simultaneous execution of identical tasks results in redundant work (~1% of cases)
  • Compatibility: Conflicts with planned automatic workflow cleanup feature
  • Storage: Cloud storage costs (offset by compute savings from cache hits)

Related Issues


Status: Draft design document for discussion and feedback
Version: 1.0

@netlify

This comment was marked as off-topic.

Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@jorgee jorgee force-pushed the 20260202-global-cache branch from 861368a to 46832f3 Compare February 3, 2026 10:56

## Non-goals

- **Maintaining local filesystem cache**: Global cache is cloud storage only
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it's worth, it's actually quite easy to support local filesystems because nf-cloudcache works out-of-the-box with them. We only disallow it in the runtime, but #6100 re-allows it to help with testing.

The only consideration is handling race conditions. I assume you could just use regular file locks.

Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@pditommaso
Copy link
Member

I'd like to explore an approach based on File Content Sketch + Bloom Filter.

This should allow avoding re-hashing large files by caching sketch → fullHash mappings, with a Bloom filter as fast pre-check.

Components:

  • Sketch→Hash store: persistent mapping of file sketches to their full BLAKE3 hashes
  • Bloom filter: fast check to avoid unnecessary store lookups

Flow:

  1. Compute cheap sketch: hash(size, first4KB, last4KB, middle4KB) — ~1ms any file size
  2. Check bloomFilter.mightContain(sketch)
    • NO → sketch definitely not in store, compute full BLAKE3 hash, save mapping, update Bloom filter
    • MAYBE → lookup sketch in store
      • Found → return cached full hash (skip expensive BLAKE3)
      • Not found (false positive) → compute full BLAKE3 hash, save mapping

@jorgee
Copy link
Contributor Author

jorgee commented Feb 5, 2026

I will have a look about the file sketches, the trick of this solution is to find a good sample with very low chance of collision of file sketches that could produce false positives in the global cache.

@pditommaso
Copy link
Member

Worth giving a try!

@jorgee
Copy link
Contributor Author

jorgee commented Feb 5, 2026

I look deeper in the proposed solution:

  • I see some redundancy in the solution. Bloom filter and file sketch store are good to ensure when a file has no appear and it is mandatory to calculate the hash, but what is incorrect in the solution is using it to reuse the hash. If two different files can generate the same sketch, when going to the store (either through the bloom filter or not), you will get the same hash. Then, if we are allowing this fact, why not use the sketch as hash? The result will be the same. Then, there is no need for the store and bloom filter.
  • Regarding computation of sketches, options with a low chance of collision (minHash,..) require to access different blocks across the file. When computing them for cloud storage, either we need to download the whole file or do several calls, so it will not be fast.

@pditommaso
Copy link
Member

When computing them for cloud storage, either we need to download the whole file

Not really, S3 api allows to to access arbitrary file chunks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants