Skip to content

Add Seqera NIO filesystem for datasets and refactor TowerClient/TowerObserver split#6946

Open
jorgee wants to merge 12 commits intomasterfrom
260310-seqera-dataset-fs
Open

Add Seqera NIO filesystem for datasets and refactor TowerClient/TowerObserver split#6946
jorgee wants to merge 12 commits intomasterfrom
260310-seqera-dataset-fs

Conversation

@jorgee
Copy link
Copy Markdown
Contributor

@jorgee jorgee commented Mar 19, 2026

Summary

  • Implements a seqera:// NIO FileSystemProvider in nf-tower, enabling Nextflow pipelines to reference Seqera Platform datasets as standard file paths (e.g. seqera://org/workspace/datasets/name)
  • Path hierarchy: root → org → workspace → resource type → dataset file (with optional @version pinning)
  • Refactors TowerClient into two classes: TowerClient (pure HTTP API client) and TowerObserver (workflow telemetry via TraceObserverV2), so the API client can be reused by the new filesystem without pulling in observer lifecycle. This is done to allow using the FS without requiring to create an observer.
  • Merges TowerCommonApi into TowerClient, which is now the natural home for shared API methods
  • Registers via META-INF/services/java.nio.file.spi.FileSystemProvider

New files

File Purpose
TowerObserver Extracted TraceObserverV2 implementation (task events, heartbeats, workflow lifecycle) formerly in TowerClient
dataset/SeqeraDatasetClient API calls: list orgs/workspaces/datasets/versions, download
fs/SeqeraPath Path implementation with 0–4 depth hierarchy
fs/SeqeraFileSystem FileSystem with lazy org/workspace/dataset caches
fs/SeqeraFileSystemProvider FileSystemProvider SPI: read, write, list, attributes, copy
fs/SeqeraFileAttributes BasicFileAttributes backed by dataset metadata
fs/SeqeraPathFactory Nextflow PathFactory integration
fs/ResourceTypeHandler, fs/DatasetsResourceHandler Extensibility interface for future resource types
fs/DatasetInputStream, fs/DatasetOutputStream Stream wrappers for dataset read/write
exception/ForbiddenException, exception/NotFoundException HTTP error types for API responses

Changes to existing files

File Change
TowerClient Stripped of observer logic; now a pure API client. Added public sendApiRequest() + GET support in makeRequest(). Absorbed TowerCommonApi methods.
TowerCommonApi (deleted) Methods merged into TowerClient
TowerFactory Creates TowerObserver and TowerClient separately. client() now also activates when accessToken is present, so seqera:// paths work without tower.enabled
TowerPlugin Registers SeqeraPathFactory
BaseCommandImpl, AuthCommandImpl, LaunchCommandImpl Updated to use the refactored TowerClient API
Tests Split accordingly: TowerClientTest for API client, TowerObserverTest for observer; new tests for all fs/ and dataset/ classes

Test plan

  • SeqeraPathTest — path parsing, URI round-trips, relativize/resolve, getFileName, asUri
  • SeqeraFileSystemTest — cache loading, workspace/dataset resolution, thread safety
  • SeqeraFileSystemProviderTest — newInputStream (latest + pinned version), readAttributes, newDirectoryStream, error propagation
  • SeqeraDatasetClientTest — API URL construction, response mapping, error handling
  • TowerObserverTest — extracted observer logic works identically
  • TowerClientTest — API client methods after refactor
./gradlew :plugins:nf-tower:test

jorgee added 4 commits March 19, 2026 11:42
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@netlify
Copy link
Copy Markdown

netlify bot commented Mar 19, 2026

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit c75686c
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/69d79466e504900008eb9e87

@jorgee jorgee marked this pull request as draft March 19, 2026 14:09
@jorgee
Copy link
Copy Markdown
Contributor Author

jorgee commented Mar 19, 2026

Some comments about current implementation:

  • Some refactoring is needed to decouple the Tower client (API calls) from the Observer. A client initialization must be done at filesystem initialization and another at observer. Due to token refresh, HxClient must be shared to avoid authentication problems when using different clients. I would like to make it after merging Add platform-related metadata to WorkflowRun lineage record #6545
  • The Dataset API does not allow streaming the content, so read and write are done through temporary files.
  • Only csv and tsv extensions are allowed; the format is recognized by the extension.
  • Due to the above comments, I am considering making it read-only
  • Every change in the dataset creates a new version seqera://org/workspace/datasets/name accesses the latest version and seqera://org/workspace/datasets/name@version

jorgee added 7 commits March 20, 2026 11:39
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
…et-fs

Signed-off-by: jorgee <jorge.ejarque@seqera.io>
…et-fs

Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
@jorgee jorgee changed the title Add seqera:// NIO filesystem for Seqera Platform datasets Add seqera:// NIO filesystem and refactor TowerClient/TowerObserver split Apr 9, 2026
@jorgee jorgee changed the title Add seqera:// NIO filesystem and refactor TowerClient/TowerObserver split Add Seqera NIO filesystem for datasets and refactor TowerClient/TowerObserver split Apr 9, 2026
@jorgee jorgee marked this pull request as ready for review April 9, 2026 11:58
@jorgee
Copy link
Copy Markdown
Contributor Author

jorgee commented Apr 9, 2026

Updated to the latest changes in master. Ready for review. It is implemented as read-only FS

  • The dataset path: seqera://<org>/<workspace>/datasets/<name>
  • You can use nexflow fs command to browse datasets in Seqera Platform
# List orgs
$ nextflow fs ls seqera://*
seqera-academy
nf-core
seqeralabs
community

#List worspaces
$ nextflow fs ls seqera://nf-core/*
AWSmegatests

# List avaialable items in the workspace (currently just datasets)
$ nextflow fs ls seqera://nf-core/AWSmegatests/*
datasets

# List available datasets
$ nextflow fs ls seqera://nf-core/AWSmegatests/datasets/*
GM_nascent
feb-16-test-6
methylseq_test_full
node-red-tests
proteinfamilies
rnaseq_samplesheet_full
test_rnaseq

# show content of a dataset
$ nextflow fs cat seqera://nf-core/AWSmegatests/datasets/test_rnaseq
sample,fastq_1,fastq_2,strandedness
WT_REP1,s3://nf-core-awsmegatests/rnaseq/input_data/minimal/GSE110004/SRR6357070_1.fastq.gz,s3://nf-core-awsmegatests/rnaseq/input_data/minimal/GSE110004/SRR6357070_2.fastq.gz,reverse
WT_REP1,s3://nf-core-awsmegatests/rnaseq/input_data/minimal/GSE110004/SRR6357071_1.fastq.gz,s3://nf-core-awsmegatests/rnaseq/input_data/minimal/GSE110004/SRR6357071_2.fastq.gz,reverse
WT_REP2,s3://nf-core-awsmegatests/rnaseq/input_data/minimal/GSE110004/SRR6357072_1.fastq.gz,s3://nf-core-awsmegatests/rnaseq/input_data/minimal/GSE110004/SRR6357072_2.fastq.gz,reverse
RAP1_UNINDUCED_REP1,s3://nf-core-awsmegatests/rnaseq/input_data/minimal/GSE110004/SRR6357073_1.fastq.gz,,reverse
RAP1_UNINDUCED_REP2,s3://nf-core-awsmegatests/rnaseq/input_data/minimal/GSE110004/SRR6357074_1.fastq.gz,,reverse
RAP1_UNINDUCED_REP2,s3://nf-core-awsmegatests/rnaseq/input_data/minimal/GSE110004/SRR6357075_1.fastq.gz,,reverse
RAP1_IAA_30M_REP1,s3://nf-core-awsmegatests/rnaseq/input_data/minimal/GSE110004/SRR6357076_1.fastq.gz,s3://nf-core-awsmegatests/rnaseq/input_data/minimal/GSE110004/SRR6357076_2.fastq.gz,reverse

# Download a dataset
$ nextflow fs cp seqera://nf-core/AWSmegatests/datasets/test_rnaseq test_rnaseq.csv
$ cat test_rnaseq.csv 
sample,fastq_1,fastq_2,strandedness
WT_REP1,s3://nf-core-awsmegatests/rnaseq/input_data/minimal/GSE110004/SRR6357070_1.fastq.gz,s3://nf-core-awsmegatests/rnaseq/input_data/minimal/GSE110004/SRR6357070_2.fastq.gz,reverse
WT_REP1,s3://nf-core-awsmegatests/rnaseq/input_data/minimal/GSE110004/SRR6357071_1.fastq.gz,s3://nf-core-awsmegatests/rnaseq/input_data/minimal/GSE110004/SRR6357071_2.fastq.gz,reverse
WT_REP2,s3://nf-core-awsmegatests/rnaseq/input_data/minimal/GSE110004/SRR6357072_1.fastq.gz,s3://nf-core-awsmegatests/rnaseq/input_data/minimal/GSE110004/SRR6357072_2.fastq.gz,reverse
RAP1_UNINDUCED_REP1,s3://nf-core-awsmegatests/rnaseq/input_data/minimal/GSE110004/SRR6357073_1.fastq.gz,,reverse
RAP1_UNINDUCED_REP2,s3://nf-core-awsmegatests/rnaseq/input_data/minimal/GSE110004/SRR6357074_1.fastq.gz,,reverse
RAP1_UNINDUCED_REP2,s3://nf-core-awsmegatests/rnaseq/input_data/minimal/GSE110004/SRR6357075_1.fastq.gz,,reverse
RAP1_IAA_30M_REP1,s3://nf-core-awsmegatests/rnaseq/input_data/minimal/GSE110004/SRR6357076_1.fastq.gz,s3://nf-core-awsmegatests/rnaseq/input_data/minimal/GSE110004/SRR6357076_2.fastq.gz,reverse
  • A data set can be used in a pipeline as the following (only for read access at this moment)
params.dataset = 'seqera://seqeralabs/showcase/datasets/sarek_samples'
process TEST {
        input:
                path(file)
        output:
                stdout
        script:
        """
        cat $file
        """
}

workflow {
        TEST(file(params.dataset)).view()
}

@bentsherman bentsherman requested a review from pditommaso April 9, 2026 13:55
@pditommaso pditommaso requested a review from jordeu April 9, 2026 14:13
@pditommaso
Copy link
Copy Markdown
Member

Pulling @jordeu to be sure this is aligned with Fusion

@bentsherman
Copy link
Copy Markdown
Member

@jorgee can you write a small ADR describing the seqera filesystem hierarchy? that way we can make sure Fusion and Nextflow are aligned more easily

Copy link
Copy Markdown
Member

@pditommaso pditommaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Agent-generated review — worth looking into these points.


Critical

1. Dead getAccessToken() in TowerObserver (TowerObserver.groovy:251-255)
No accessToken field exists in TowerObserver — Groovy will resolve it as a null dynamic property, so this always throws. It's dead code left over from the split. Remove it.

2. Dead loadSchema() in TowerObserver (TowerObserver.groovy:537-546)
Identical to TowerClient.loadSchema() and never called from TowerObserver. Remove.

3. newFileSystem() violates NIO contract (SeqeraFileSystemProvider.groovy:76-88)
Per FileSystemProvider.newFileSystem(), if a FS already exists for the URI, it must throw FileSystemAlreadyExistsException. Currently it silently returns the existing instance. The getOrCreateFileSystem() method already handles the "get or create" pattern correctly.


Important

4. getFileSystem() is not synchronized (SeqeraFileSystemProvider.groovy:91)
newFileSystem() and getOrCreateFileSystem() are synchronized, but getFileSystem() reads from the LinkedHashMap without synchronization — data race. Either synchronize it or switch to ConcurrentHashMap.

5. resolveDataset() thread safety gap (SeqeraFileSystem.groovy:186)
Calls synchronized resolveDatasets() which returns the mutable list from datasetCache, then iterates it outside the lock. Concurrent invalidateDatasetCache() could cause ConcurrentModificationException. Either synchronize resolveDataset() or return a defensive copy from resolveDatasets().

6. SeqeraPath.iterator() returns absolute paths instead of name components (SeqeraPath.groovy:386-394)
NIO contract says Path.iterator() yields relative name elements (acme, research, datasets, samples). This implementation returns increasingly deeper absolute paths. Will break code relying on standard Path.iterator() semantics.

7. startsWith()/endsWith() use string comparison (SeqeraPath.groovy:243-259)
seqera://acme-corp.startsWith(seqera://acme) → true, which is wrong. NIO expects component-wise comparison.

8. Error message copy-paste bug in AuthCommandImpl.deleteTokenViaApi

final error = response.message ?: "HTTP ${response.message}"

The fallback should use response.code, not response.message again.


Suggestions

9. DatasetInputStream.size() returns -1 (DatasetInputStream.groovy:60)
Not a valid SeekableByteChannel.size() return value. Also, SeqeraFileAttributes.size() always returns 0 — Files.size() will report 0 for all datasets, which could break Nextflow's file staging/progress logic.

10. fsKey() always returns just the scheme (SeqeraFileSystemProvider.groovy:292-295)
One FS per JVM is the intent, but if different endpoints/tokens are ever needed this will silently share the wrong FS. At minimum document the limitation.

11. ResourceTypeHandler / DatasetsResourceHandler are dead code
SeqeraFileSystemProvider handles datasets inline and doesn't use these strategy classes. Either wire them in or remove until needed.

12. DatasetOutputStream references unimplemented uploadDataset()
SeqeraDatasetClient.uploadDataset() throws UnsupportedOperationException. Since the FS is read-only, DatasetOutputStream is unreachable but should be removed or clearly gated.

13. Unrelated changes bundled
ScriptResolveVisitor.java (Record/Tuple input resolution) and CmdModuleCreate.groovy (namespace validation) are unrelated to this feature — ideally separate PRs for bisectability.

14. Full memory buffering on download (SeqeraDatasetClient.groovy:138)
Entire dataset loaded as String then converted to ByteArrayInputStream — ~40 MB heap per concurrent download for large datasets. Known limitation per PR description, but worth a TODO.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants