Skip to content

Alphafold in nextflow using azure batch #6842

@venkatt007

Description

@venkatt007

Azure Batch + nf-core/proteinfold: AlphaFold DB Files Always Staging (Even with blobfuse2 Mounts)

Hi all,

I’m running nf-core/proteinfold (v1.1.1) on Azure Batch using the azurebatch executor in closed private network, and I’m trying to prevent the massive AlphaFold database from being staged into the Azure Blob work directory on every run.

Despite using blobfuse2 mounts and setting stageInMode = 'symlink', the pipeline continues to stage DB files into az://.../work/stage-*. Files are properly mounted in the batch node as well.

I’m looking for confirmation of expected behavior and/or best practice for this architecture.

Environment

Nextflow: 25.10.x
nf-core/proteinfold: 1.1.1
Executor: azurebatch
Containers: Docker
Azure Storage: Blob storage mounted on compute nodes via blobfuse2
Database size: Multi-terabyte AlphaFold reference DB

Azure Batch Setup

Each Batch node has blobfuse2 mounts configured at:

/mnt/batch/tasks/fsmounts/input/mnt/batch/tasks/fsmounts/results/mnt/batch/tasks/fsmounts/work

Verified with:

blobfuse2  fuse  24G  ...  /mnt/batch/tasks/fsmounts/input
blobfuse2  fuse  24G  ...  /mnt/batch/tasks/fsmounts/results
blobfuse2  fuse  24G  ...  /mnt/batch/tasks/fsmounts/work

The AlphaFold DB is located under:

/mnt/batch/tasks/fsmounts/work/alphafolddb/alphafolddb

Goal

Avoid staging/copying the AlphaFold DB into:

az:///work/stage-/...

The DB already exists on mounted storage accessible to all nodes.

Configuration Attempt

nextflow config (simplified)

process {
  executor = 'azurebatch'
  stageInMode  = 'symlink'
  stageOutMode = 'rsync'
}

workDir = '/mnt/batch/tasks/fsmounts/work/work'

fusion.enabled = false
wave.enabled   = false
tower.enabled  = false

docker.enabled = true

params

input:  "/mnt/batch/tasks/fsmounts/input/samplesheet.csv" outdir: "/mnt/batch/tasks/fsmounts/results/test1"alphafold2_db: "/mnt/batch/tasks/fsmounts/work/alphafolddb/alphafolddb"bfd_path: "/mnt/batch/tasks/fsmounts/work/alphafolddb/alphafolddb/bfd/*" ...

Observed Behavior

Even with:

Local POSIX paths only (no az://)
stageInMode = 'symlink'
Mounted storage on all nodes

The log still shows:

FilePorter - Copying foreign file /mnt/batch/tasks/fsmounts/work/alphafolddb/...
to work dir: az:///work/stage-/...

And on interruption:

port 4: (value) bound ; channel: bfd/* port 5: (value) bound ; channel: small_bfd/* port 6: (value) bound ; channel: mgnify/* ...

So it appears that the pipeline is materializing DB glob paths as path inputs, which forces Azure Batch localization via object storage staging.

What I’ve Tried

Using blobfuse2 mounts only (no Fusion)
Using Fusion instead of blobfuse
Mounting DB inside container with containerOptions
Overriding RUN_ALPHAFOLD2 module to use DB root directly
Using both az:// and POSIX-only configurations

The staging persists as long as DB-related parameters are passed as path inputs.

My Understanding (Please Confirm)

It seems that:

Azure Batch executor requires inputs to be localized into the remote workDir.
If a process declares path inputs (e.g., path('bfd/')), Nextflow treats them as managed inputs.
On Azure Batch, this results in uploading those files into the az://work/stage-
area.
blobfuse mounts do not prevent this behavior.
The only way to avoid DB staging is:

Use Fusion with az:// paths, or
Refactor the pipeline so the DB is passed as a val string (not path inputs).

Is that correct?

Questions

Is there any supported way to prevent localization of large path inputs on Azure Batch when using mounted blob storage?
Has anyone successfully run nf-core/proteinfold on private Azure Batch with multi-TB AlphaFold DBs without massive staging overhead?

Any clarification or architectural recommendations would be greatly appreciated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions