Add Seqera NIO filesystem for datasets and refactor TowerClient/TowerObserver split#6946
Add Seqera NIO filesystem for datasets and refactor TowerClient/TowerObserver split#6946
Conversation
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
✅ Deploy Preview for nextflow-docs-staging canceled.
|
|
Some comments about current implementation:
|
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
…et-fs Signed-off-by: jorgee <jorge.ejarque@seqera.io>
…et-fs Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
Signed-off-by: jorgee <jorge.ejarque@seqera.io>
|
Updated to the latest changes in master. Ready for review. It is implemented as read-only FS
|
|
Pulling @jordeu to be sure this is aligned with Fusion |
|
@jorgee can you write a small ADR describing the seqera filesystem hierarchy? that way we can make sure Fusion and Nextflow are aligned more easily |
pditommaso
left a comment
There was a problem hiding this comment.
Critical
1. Dead getAccessToken() in TowerObserver (TowerObserver.groovy:251-255)
No accessToken field exists in TowerObserver — Groovy will resolve it as a null dynamic property, so this always throws. It's dead code left over from the split. Remove it.
2. Dead loadSchema() in TowerObserver (TowerObserver.groovy:537-546)
Identical to TowerClient.loadSchema() and never called from TowerObserver. Remove.
3. newFileSystem() violates NIO contract (SeqeraFileSystemProvider.groovy:76-88)
Per FileSystemProvider.newFileSystem(), if a FS already exists for the URI, it must throw FileSystemAlreadyExistsException. Currently it silently returns the existing instance. The getOrCreateFileSystem() method already handles the "get or create" pattern correctly.
Important
4. getFileSystem() is not synchronized (SeqeraFileSystemProvider.groovy:91)
newFileSystem() and getOrCreateFileSystem() are synchronized, but getFileSystem() reads from the LinkedHashMap without synchronization — data race. Either synchronize it or switch to ConcurrentHashMap.
5. resolveDataset() thread safety gap (SeqeraFileSystem.groovy:186)
Calls synchronized resolveDatasets() which returns the mutable list from datasetCache, then iterates it outside the lock. Concurrent invalidateDatasetCache() could cause ConcurrentModificationException. Either synchronize resolveDataset() or return a defensive copy from resolveDatasets().
6. SeqeraPath.iterator() returns absolute paths instead of name components (SeqeraPath.groovy:386-394)
NIO contract says Path.iterator() yields relative name elements (acme, research, datasets, samples). This implementation returns increasingly deeper absolute paths. Will break code relying on standard Path.iterator() semantics.
7. startsWith()/endsWith() use string comparison (SeqeraPath.groovy:243-259)
seqera://acme-corp.startsWith(seqera://acme) → true, which is wrong. NIO expects component-wise comparison.
8. Error message copy-paste bug in AuthCommandImpl.deleteTokenViaApi
final error = response.message ?: "HTTP ${response.message}"The fallback should use response.code, not response.message again.
Suggestions
9. DatasetInputStream.size() returns -1 (DatasetInputStream.groovy:60)
Not a valid SeekableByteChannel.size() return value. Also, SeqeraFileAttributes.size() always returns 0 — Files.size() will report 0 for all datasets, which could break Nextflow's file staging/progress logic.
10. fsKey() always returns just the scheme (SeqeraFileSystemProvider.groovy:292-295)
One FS per JVM is the intent, but if different endpoints/tokens are ever needed this will silently share the wrong FS. At minimum document the limitation.
11. ResourceTypeHandler / DatasetsResourceHandler are dead code
SeqeraFileSystemProvider handles datasets inline and doesn't use these strategy classes. Either wire them in or remove until needed.
12. DatasetOutputStream references unimplemented uploadDataset()
SeqeraDatasetClient.uploadDataset() throws UnsupportedOperationException. Since the FS is read-only, DatasetOutputStream is unreachable but should be removed or clearly gated.
13. Unrelated changes bundled
ScriptResolveVisitor.java (Record/Tuple input resolution) and CmdModuleCreate.groovy (namespace validation) are unrelated to this feature — ideally separate PRs for bisectability.
14. Full memory buffering on download (SeqeraDatasetClient.groovy:138)
Entire dataset loaded as String then converted to ByteArrayInputStream — ~40 MB heap per concurrent download for large datasets. Known limitation per PR description, but worth a TODO.
Summary
seqera://NIOFileSystemProviderinnf-tower, enabling Nextflow pipelines to reference Seqera Platform datasets as standard file paths (e.g.seqera://org/workspace/datasets/name)@versionpinning)TowerClientinto two classes:TowerClient(pure HTTP API client) andTowerObserver(workflow telemetry viaTraceObserverV2), so the API client can be reused by the new filesystem without pulling in observer lifecycle. This is done to allow using the FS without requiring to create an observer.TowerCommonApiintoTowerClient, which is now the natural home for shared API methodsMETA-INF/services/java.nio.file.spi.FileSystemProviderNew files
TowerObserverTraceObserverV2implementation (task events, heartbeats, workflow lifecycle) formerly inTowerClientdataset/SeqeraDatasetClientfs/SeqeraPathPathimplementation with 0–4 depth hierarchyfs/SeqeraFileSystemFileSystemwith lazy org/workspace/dataset cachesfs/SeqeraFileSystemProviderFileSystemProviderSPI: read, write, list, attributes, copyfs/SeqeraFileAttributesBasicFileAttributesbacked by dataset metadatafs/SeqeraPathFactoryPathFactoryintegrationfs/ResourceTypeHandler,fs/DatasetsResourceHandlerfs/DatasetInputStream,fs/DatasetOutputStreamexception/ForbiddenException,exception/NotFoundExceptionChanges to existing files
TowerClientsendApiRequest()+ GET support inmakeRequest(). AbsorbedTowerCommonApimethods.TowerCommonApi(deleted)TowerClientTowerFactoryTowerObserverandTowerClientseparately.client()now also activates whenaccessTokenis present, soseqera://paths work withouttower.enabledTowerPluginSeqeraPathFactoryBaseCommandImpl,AuthCommandImpl,LaunchCommandImplTowerClientAPITowerClientTestfor API client,TowerObserverTestfor observer; new tests for allfs/anddataset/classesTest plan
SeqeraPathTest— path parsing, URI round-trips, relativize/resolve,getFileName,asUriSeqeraFileSystemTest— cache loading, workspace/dataset resolution, thread safetySeqeraFileSystemProviderTest— newInputStream (latest + pinned version), readAttributes, newDirectoryStream, error propagationSeqeraDatasetClientTest— API URL construction, response mapping, error handlingTowerObserverTest— extracted observer logic works identicallyTowerClientTest— API client methods after refactor