Support for running multiple bin QC tools #842

harper357 · 2025-08-05T04:46:15Z

This PR adds support for running multiple bin QC tools. Fixing issue #713 .

It does this by adding an extra parameter, binqc_tool_extras where users can specify extra tools to run and then just saving the outputs, but not using them for any downstream process.

This will probably have to be rebased after PR#841.

Not sure if this needs any new tests.

I have this marked as a draft because I am not completely happy with having to add a new parameter, and it doesn't add everything into the bin summary, but I am not sure if those are deal breakers.

PR checklist

This comment contains a description of changes (with reason).
If you've fixed a bug or added code that should be tested, add tests!
Make sure your code lints (nf-core pipelines lint).
Output Documentation in docs/output.md is updated.
CHANGELOG.md is updated.

prototaxites · 2025-08-05T08:05:27Z

Hi, thanks for the PR - the code looks really clean! I have a couple of initial thoughts about the implementation which might be good to discuss before diving into proper review of the code itself.

It does this by adding an extra parameter, binqc_tool_extras where users can specify extra tools to run and then just saving the outputs, but not using them for any downstream process.

I agree with you that this seems a bit clunky! I'm not 100% sure what the best option is here, but I can think of two options off the top of my head:

Have --enable_busco, --enable_checkm, --enable_checkm2 parameters that turn on these tools, and --binqc_tool is explicitly defined as the tool used downstream. This might require reworking some logic to catch when they're all disabled but --skip_binqc is still on.
Have a single parameter, --binqc_tools, that takes a comma-separated list. The first listed tool is the one used downstream - we can be clear about it in the docs. It's cleaner from a parameterisation perspective, but maybe not so discoverable.

prototaxites · 2025-08-05T08:12:45Z

subworkflows/local/bin_qc.nf

+                .groupTuple()
+        }
+        else {
+            BUSCO_BUSCO.out.batch_summary


IMO I would prefer that these are published via the publishDir directive (and when it arrives in nf-core, the new publishing system) than using collectFile

Doesn't publishDir need to be inside a process? I had originally tried using calls to csvtk/concat, but had some trouble getting it actually publish the combined summary.

It does (well, or in nf-core, defined in modules.conf). The way the subWF currently works passes everything to a csvtk/concat process called CONCAT_BINQC_TSV - it should be possible to either clone this code three times (and its config in modules.conf), and emit the three summaries separately (repetitive), or mix all the outputs from each binQC tool into a single channel with a meta.binqc_tool, group, and then feed it into the process for up to three iterations of the process. Then we can filter the right binQC summary using the meta object.

(and its config in modules.conf)

Oh! This is why it didn't work for me originally. Ok, I changed it back and got it working.

dialvarezs · 2025-08-05T10:33:40Z

IMHO option 2 feels more "natural", and we can clarify that the first tool is the one used to determine which bins get in the taxonomic classification. And them integrate the results from all of them in the bin summary.

prototaxites · 2025-08-05T10:37:36Z

IMHO option 2 feels more "natural", and we can clarify that the first tool is the one used to determine which bins get in the taxonomic classification. And them integrate the results from all of them in the bin summary.

I agree - but just want to flag up that we've explicitly avoided comma-separated lists for parameters in the past, so if we're going to do it we should be mindful about the interface

dialvarezs · 2025-08-05T10:51:22Z

IMHO option 2 feels more "natural", and we can clarify that the first tool is the one used to determine which bins get in the taxonomic classification. And them integrate the results from all of them in the bin summary.

I agree - but just want to flag up that we've explicitly avoided comma-separated lists for parameters in the past, so if we're going to do it we should be mindful about the interface

That's fair, we should decide if we start including list parameters and probably refactor some parameter to be consistent if we decide too (binning comes to my mind, specially if we plan to include more binners in the future).

Personally, I don't see the downsides of using list parameters, they make the implementation of new tools simpler, and the user interface is good too. Other nf-core pipelines have them, for example sarek has a tools param (and it's huge, but we'll only have 3 options).

harper357 · 2025-08-05T16:03:15Z

@prototaxites Thanks for the quick comments!

I am happy to change things if there is a consensus

Option 1 is definitely the most explicit, but does make it more more difficult to add more tools in the future compared to what I did. I don't think it makes the logic too much more difficult, but maybe it will when/if Nextflow requires params are explicitly passed to workflows.

Option 2 I worry about using an ordered list. To me, this will make things confusing for new users since very few options are order dependent. It would require that people really read the documentation to understand how it works, which in my experience doesn't happen as often as it should.

As much as it feels clunky, my implementation does look like a good middle ground between these two options.

But again, I will change it if others think it should be done differently.

This reverts commit 57dcd71.

harper357 · 2025-08-07T01:31:53Z

Hmm, I tried to rebase, but I think I might have messed up the commit history as it shows some extra changes in the file view

prototaxites · 2025-08-07T08:22:56Z

Option 1 is definitely the most explicit, but does make it more more difficult to add more tools in the future compared to what I did. I don't think it makes the logic too much more difficult, but maybe it will when/if Nextflow requires params are explicitly passed to workflows.

I think the problem with option 1 is that there's a potential incompatible selection of parameters - e.g. the only bin QC tool enabled is BUSCO, but --binqc_tool can still be set to "checkm2". I'd rather avoid having a situation where it's possible to specify breaking parameter combinations, if possible.

Option 2 I worry about using an ordered list. To me, this will make things confusing for new users since very few options are order dependent. It would require that people really read the documentation to understand how it works, which in my experience doesn't happen as often as it should.

While I agree that order-dependence would be a new interaction pattern with the pipeline, I don't think it in itself is particularly problematic, if it's documented well. I imagine many users would probably just want to run a single tool, in which case the option is fine and operates how it currently does, with no change to the interface - only users needing the 'advanced' feature would need to read the documentation. And hopefully if they get it wrong, the results would not be too dramatically different in any case.

My bigger problem with option 2 is that (aside from changing the schema from a string to a comma-separated list, as discussed above) it would also represent a switch from "on-by-default" tools with --skip_X parameters, to "off-by-default` tools which get enabled by specific parameters.

(I actually prefer the latter, but don't think we should mix them)

harper357 · 2025-08-12T21:14:44Z

@prototaxites Just to make sure I understand you, are you suggesting that neither option 1 or 2 should move forward? Implying either my current implementation (for consistency lets call it option 0) or some new option needs to be thought of?

Because I was thinking about your message over this last weekend and I think option 0 is still a nice middle ground. --binqc_tool_additional would be analogous to run_gunc while not messing with --skip_binqc or --binqc_tool.

I am not sure if this is what you meant, but there are several cases of "off-by-default" tools/subworkflows in this pipeline, ex: --ancient_dna & --run_virus_identification, but I do agree that trying to stay consistent is important.

prototaxites · 2025-08-13T08:26:07Z

@prototaxites Just to make sure I understand you, are you suggesting that neither option 1 or 2 should move forward?

To clarify, I still largely prefer option 2 - just pointing out that it would be a change compared to the way many parameters are specified. That's a good point about there already being some --enable or --run parameters, though.

@jfy133 Wonder if you have any opinions on this?

jfy133 · 2025-08-15T12:46:20Z

Finally catching up - and thank you for the patience.

Actually I personally generally prefer to be more up-front declarative. I never really liked comma separated lists as user-wise they are more prone to user typos and validation of these means writing more custom validation code. Booleans are easier to handle with options etc. (and this is what we already do).

That said, I agree that we should not mix two approaches though and be considerate of not breaking 'backwards compatibility' too much.

Given we have --bin_qc_tool 'busco' as a string selection, I think it would be more user friendly to go with option 2 just to change this to allow a comma separated list (for people with existing params.yaml files, they won't necessarily have to change anything).

But diverging from option 2: not worry about ordering: we can the have the second flag then to pick the exact tool to use for downstream tools. When the number of binners in --bin_qc_tool is one, the default could be to take just that for the downstream parameter, otherwise when it is more than 1 we have validation code to error out unless the user specifies which one should go to e.g. GTDBTk etc.

What does everyone think?

jfy133 · 2025-08-15T12:50:14Z

Pseudo code:

params.bin_qc_tool = 'busco'

def nr_binning_tools = params.bin_qc_tool.split(',').length()
if (nf_binning_tools > 1 && !params.bin_downstream_tool) { error([nf-core/mag] ERROR: You specified to run multiple Bin QC tools but not which one to use downstream with --bin_downstream_tool. Please check input!) }
def binner_for_downstream =  nf_binning_tools > 1 && !params.bin_downstream_tool ? params.binc_qc_tool : bin_downstream_tool

prototaxites · 2025-08-15T14:59:19Z

Could we mix both approaches? Have a comma-separated list of binqc tools and a downstream parameter, but we pick the first in the list (with a visible warning!) if the downstream parameter is not explicitly set?

That way, the pipeline will run if a user forgets to specify both parameters - but notify them about the default action:

params.bin_qc_tool = 'busco'

def nr_binning_tools = params.bin_qc_tool.split(',').length()
if (nf_binning_tools > 1 && !params.bin_downstream_tool) { 
    log.warn("[nf-core/mag] WARN: You specified to run multiple Bin QC tools but not which one to use downstream with --bin_downstream_tool. Defaulting to the first in the list, which is ${params.bin_qc_tool.split(',')[0]}") 
}
def binner_for_downstream =  nf_binning_tools > 1 && !params.bin_downstream_tool ? params.binc_qc_tool.split(",")[0] : params.bin_qc_tool

In either case it also requires a check to make sure that bin_downstream_tool is actually in bin_qc_tool.

harper357 · 2025-08-15T16:10:23Z

But diverging from option 2:

Could we mix both approaches?

Doesn't option 0 already solve these with fewer changes? I am just a little confused because you are both talking about adding a new parameter to effectively replace binqc_tool and change the usage of binqc_tool, while option 0 keeps binqc_tool as is and just add an optional parameter. This maintains backwards compatibility, and doesn't require that people specify a tool in multiple places.

more prone to user typos and validation of these means writing more custom validation code...

This is already taken care of by the parameter schema, which is a "simple" regex.

prototaxites · 2025-08-15T16:16:18Z

My problem with option 0 is that it splits the same information (which bin QC tools to run) over multiple parameters. With this way, we specify the tools with one parameter and choose a downstream processor with another.

harper357 · 2025-08-15T17:44:13Z

I get that, I just think having a contextual parameter (binqc_tool being order dependent if you don't use bin_downstream_tool and not order dependent if you do) is more clunky having one parameter for "run and use this tool's output" and another for "run these tools and don't use the outputs".

If i am in the minority in thinking that, then ill stop trying to convince you

jfy133 · 2025-08-16T11:09:59Z

I'm still against using ordering to guide anything. It is an implicit behaviour which people will not intuitively understand based on typical Unix behaviour (which at least for me I can't think of any tool that uses ordering for things...).

My approach is that if you want to have your cake and eat it , you need to be responsible for it. I.e. if you truly want to run multiple QC tools, you need to be responsible in picking which one to use for downstream steps - this is an important choice that could have quite big ramifications ) thus what I mean by I like being more declarative and upfront)

That said I agree with Jim that I don't like splitting the same 'functionality' across two parameters.

We should always go broad to specific as you are narrowing down analysis to be more precise in downstream steps

I also think that if you want to resume but switch QC tool you want to to use downstream, you would only have to switch the one parameter to pick another of your choices rather than shuffle options between the two:

--binqc_tool checkm,busco,checkm2
--downstream checkm

## to

--binqc_tool checkm,busco,checkm2
--downstream busco

Va

--binqc_tool checkm
--extra busco,checkm2

## to

--binqc_tool busco
--extra checkm,checkm2

(Assuming I am understanding option 0 directly)

…dmundmiller hack to get SPADES to work on Fusion

…re on exit code 1 for MAG_DEPTHS_PLOTS,

…ue to summarise problem

…odules.confg)

…ent identity parameters, and why

Co-authored-by: James A. Fellows Yates <[email protected]>

harper357 · 2025-10-07T01:12:11Z

@jfy133 if you have time to work on this in the next few days, I might have to hand it off to you. I won't have time to work on it again for a few days.

In short, I added TODOs to all the spots/files I think need updating and then started from the beginning of the pipeline updating things. I was trying to keep all the enabled tools summaries in a single channel so it would be easier to add more down the line.

This means that GTDBTK would need to just a single "summaries" channel, split it out and then process each file. Slightly different than what @prototaxites suggested, but hopefully you two like the idea.

Files needing updating:

prototaxites reviewed Aug 5, 2025

View reviewed changes

harper357 added 12 commits August 6, 2025 17:54

some comments about possible solutions

60450b6

added params.binqc_tool_extras

985fd2d

change to run if tools in params.binqc_tool_extras

a6d6427

updated schema

89503cb

split params.binqc_tool_extras into a list

fc9f07c

concat outputs from binqc_tool_extras tools

fc49911

add binqc_tool_extras check for db downloads

9724454

remove redudant comments

4ad5b41

change concat calls to .collectFile calls

c000974

Added short info for new summary tables

3aeca36

Revert "change concat calls to .collectFile calls"

4d1c5f1

This reverts commit 57dcd71.

added config entry to publish concatted summaries

f8a7969

harper357 force-pushed the multiple_bin_qc branch from 2869c1a to f8a7969 Compare August 7, 2025 01:05

jfy133 and others added 28 commits October 6, 2025 17:58

Deactivate scratch on METASPADES to allow functioning with fusion

a451aa5

Use the correct Seqera (unofficial) approve from @FriederikeHanssen@e…

e1a77e3

…dmundmiller hack to get SPADES to work on Fusion

Add missing --threads parameter for metaeuk easypredict

3be1a4b

Make sure all custom exit code contiions are consistent, and add igno…

7560fa7

…re on exit code 1 for MAG_DEPTHS_PLOTS,

Use latest GTDB download link

6f0a1bc

Use correct URLs for test data samplesheet and GTDB, deactivate CAT d…

1424a2c

…ue to summarise problem

Try CheckM2 instead of BUSCO

634164e

Update CHANGELOG.md

550e64a

Fix linting (mismtach of schema with config, typo in module name in m…

a252c28

…odules.confg)

docs(coverage): add section to usage documenting how to tune the perc…

bc61826

…ent identity parameters, and why

fix(test_full): set longread_percentidentity to 85

6bebd69

docs: update changelog

b8cfe5f

lint: trailing whitespace

c3524ae

feat: post-release version bump

80e98d7

fix: bump pipeline version in all snapshots

c88532b

Apply suggestions from code review

b324244

Co-authored-by: James A. Fellows Yates <[email protected]>

added TODOs so i don't forget any

a09c114

added "enable_{tool)" for binqc

f49249a

added enable_{tool} parameters with enable_busco as true as default

fb4caea

updated modules and their imports

71d8e74

fixed relative import

554b595

changed main variables

60b45a8

support for checkm

f72f075

support for checkm2

6dd30ff

hotfix for checkm support

c01cdba

busco support

3c9b6a3

remove TODOs

9222c6d

updated TODO and changed to qc_summaries

f9d95b9

Merge branch 'dev' into multiple_bin_qc

e39661d

Support for running multiple bin QC tools #842

Are you sure you want to change the base?

Support for running multiple bin QC tools #842

Uh oh!

Conversation

harper357 commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR checklist

Uh oh!

prototaxites commented Aug 5, 2025

Uh oh!

prototaxites Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

harper357 Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

prototaxites Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

harper357 Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

dialvarezs commented Aug 5, 2025

Uh oh!

prototaxites commented Aug 5, 2025

Uh oh!

dialvarezs commented Aug 5, 2025

Uh oh!

harper357 commented Aug 5, 2025

Uh oh!

harper357 commented Aug 7, 2025

Uh oh!

prototaxites commented Aug 7, 2025

Uh oh!

harper357 commented Aug 12, 2025

Uh oh!

prototaxites commented Aug 13, 2025

Uh oh!

jfy133 commented Aug 15, 2025

Uh oh!

jfy133 commented Aug 15, 2025

Uh oh!

prototaxites commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

harper357 commented Aug 15, 2025

Uh oh!

prototaxites commented Aug 15, 2025

Uh oh!

harper357 commented Aug 15, 2025

Uh oh!

jfy133 commented Aug 16, 2025

Uh oh!

harper357 commented Oct 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

harper357 commented Aug 5, 2025 •

edited

Loading

prototaxites Aug 6, 2025 •

edited

Loading

prototaxites commented Aug 15, 2025 •

edited

Loading