Skip to content

Conversation

harper357
Copy link
Contributor

@harper357 harper357 commented Aug 5, 2025

This PR adds support for running multiple bin QC tools. Fixing issue #713 .

It does this by adding an extra parameter, binqc_tool_extras where users can specify extra tools to run and then just saving the outputs, but not using them for any downstream process.

This will probably have to be rebased after PR#841.

Not sure if this needs any new tests.

I have this marked as a draft because I am not completely happy with having to add a new parameter, and it doesn't add everything into the bin summary, but I am not sure if those are deal breakers.

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • Make sure your code lints (nf-core pipelines lint).
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.

@prototaxites
Copy link
Contributor

Hi, thanks for the PR - the code looks really clean! I have a couple of initial thoughts about the implementation which might be good to discuss before diving into proper review of the code itself.

It does this by adding an extra parameter, binqc_tool_extras where users can specify extra tools to run and then just saving the outputs, but not using them for any downstream process.

I agree with you that this seems a bit clunky! I'm not 100% sure what the best option is here, but I can think of two options off the top of my head:

  1. Have --enable_busco, --enable_checkm, --enable_checkm2 parameters that turn on these tools, and --binqc_tool is explicitly defined as the tool used downstream. This might require reworking some logic to catch when they're all disabled but --skip_binqc is still on.

  2. Have a single parameter, --binqc_tools, that takes a comma-separated list. The first listed tool is the one used downstream - we can be clear about it in the docs. It's cleaner from a parameterisation perspective, but maybe not so discoverable.

.groupTuple()
}
else {
BUSCO_BUSCO.out.batch_summary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO I would prefer that these are published via the publishDir directive (and when it arrives in nf-core, the new publishing system) than using collectFile

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't publishDir need to be inside a process? I had originally tried using calls to csvtk/concat, but had some trouble getting it actually publish the combined summary.

Copy link
Contributor

@prototaxites prototaxites Aug 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does (well, or in nf-core, defined in modules.conf). The way the subWF currently works passes everything to a csvtk/concat process called CONCAT_BINQC_TSV - it should be possible to either clone this code three times (and its config in modules.conf), and emit the three summaries separately (repetitive), or mix all the outputs from each binQC tool into a single channel with a meta.binqc_tool, group, and then feed it into the process for up to three iterations of the process. Then we can filter the right binQC summary using the meta object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and its config in modules.conf)

Oh! This is why it didn't work for me originally. Ok, I changed it back and got it working.

@dialvarezs
Copy link
Contributor

IMHO option 2 feels more "natural", and we can clarify that the first tool is the one used to determine which bins get in the taxonomic classification. And them integrate the results from all of them in the bin summary.

@prototaxites
Copy link
Contributor

IMHO option 2 feels more "natural", and we can clarify that the first tool is the one used to determine which bins get in the taxonomic classification. And them integrate the results from all of them in the bin summary.

I agree - but just want to flag up that we've explicitly avoided comma-separated lists for parameters in the past, so if we're going to do it we should be mindful about the interface

@dialvarezs
Copy link
Contributor

IMHO option 2 feels more "natural", and we can clarify that the first tool is the one used to determine which bins get in the taxonomic classification. And them integrate the results from all of them in the bin summary.

I agree - but just want to flag up that we've explicitly avoided comma-separated lists for parameters in the past, so if we're going to do it we should be mindful about the interface

That's fair, we should decide if we start including list parameters and probably refactor some parameter to be consistent if we decide too (binning comes to my mind, specially if we plan to include more binners in the future).

Personally, I don't see the downsides of using list parameters, they make the implementation of new tools simpler, and the user interface is good too. Other nf-core pipelines have them, for example sarek has a tools param (and it's huge, but we'll only have 3 options).

@harper357
Copy link
Contributor Author

@prototaxites Thanks for the quick comments!

I am happy to change things if there is a consensus

Option 1 is definitely the most explicit, but does make it more more difficult to add more tools in the future compared to what I did. I don't think it makes the logic too much more difficult, but maybe it will when/if Nextflow requires params are explicitly passed to workflows.

Option 2 I worry about using an ordered list. To me, this will make things confusing for new users since very few options are order dependent. It would require that people really read the documentation to understand how it works, which in my experience doesn't happen as often as it should.

As much as it feels clunky, my implementation does look like a good middle ground between these two options.

But again, I will change it if others think it should be done differently.

@harper357
Copy link
Contributor Author

Hmm, I tried to rebase, but I think I might have messed up the commit history as it shows some extra changes in the file view

@prototaxites
Copy link
Contributor

Option 1 is definitely the most explicit, but does make it more more difficult to add more tools in the future compared to what I did. I don't think it makes the logic too much more difficult, but maybe it will when/if Nextflow requires params are explicitly passed to workflows.

I think the problem with option 1 is that there's a potential incompatible selection of parameters - e.g. the only bin QC tool enabled is BUSCO, but --binqc_tool can still be set to "checkm2". I'd rather avoid having a situation where it's possible to specify breaking parameter combinations, if possible.

Option 2 I worry about using an ordered list. To me, this will make things confusing for new users since very few options are order dependent. It would require that people really read the documentation to understand how it works, which in my experience doesn't happen as often as it should.

While I agree that order-dependence would be a new interaction pattern with the pipeline, I don't think it in itself is particularly problematic, if it's documented well. I imagine many users would probably just want to run a single tool, in which case the option is fine and operates how it currently does, with no change to the interface - only users needing the 'advanced' feature would need to read the documentation. And hopefully if they get it wrong, the results would not be too dramatically different in any case.

My bigger problem with option 2 is that (aside from changing the schema from a string to a comma-separated list, as discussed above) it would also represent a switch from "on-by-default" tools with --skip_X parameters, to "off-by-default` tools which get enabled by specific parameters.

(I actually prefer the latter, but don't think we should mix them)

@harper357
Copy link
Contributor Author

@prototaxites Just to make sure I understand you, are you suggesting that neither option 1 or 2 should move forward? Implying either my current implementation (for consistency lets call it option 0) or some new option needs to be thought of?

Because I was thinking about your message over this last weekend and I think option 0 is still a nice middle ground. --binqc_tool_additional would be analogous to run_gunc while not messing with --skip_binqc or --binqc_tool.

I am not sure if this is what you meant, but there are several cases of "off-by-default" tools/subworkflows in this pipeline, ex: --ancient_dna & --run_virus_identification, but I do agree that trying to stay consistent is important.

@prototaxites
Copy link
Contributor

@prototaxites Just to make sure I understand you, are you suggesting that neither option 1 or 2 should move forward?

To clarify, I still largely prefer option 2 - just pointing out that it would be a change compared to the way many parameters are specified. That's a good point about there already being some --enable or --run parameters, though.

@jfy133 Wonder if you have any opinions on this?

@jfy133
Copy link
Member

jfy133 commented Aug 15, 2025

Finally catching up - and thank you for the patience.

Actually I personally generally prefer to be more up-front declarative. I never really liked comma separated lists as user-wise they are more prone to user typos and validation of these means writing more custom validation code. Booleans are easier to handle with options etc. (and this is what we already do).

That said, I agree that we should not mix two approaches though and be considerate of not breaking 'backwards compatibility' too much.

Given we have --bin_qc_tool 'busco' as a string selection, I think it would be more user friendly to go with option 2 just to change this to allow a comma separated list (for people with existing params.yaml files, they won't necessarily have to change anything).

But diverging from option 2: not worry about ordering: we can the have the second flag then to pick the exact tool to use for downstream tools. When the number of binners in --bin_qc_tool is one, the default could be to take just that for the downstream parameter, otherwise when it is more than 1 we have validation code to error out unless the user specifies which one should go to e.g. GTDBTk etc.

What does everyone think?

@jfy133
Copy link
Member

jfy133 commented Aug 15, 2025

Pseudo code:

params.bin_qc_tool = 'busco'

def nr_binning_tools = params.bin_qc_tool.split(',').length()
if (nf_binning_tools > 1 && !params.bin_downstream_tool) { error([nf-core/mag] ERROR: You specified to run multiple Bin QC tools but not which one to use downstream with --bin_downstream_tool. Please check input!) }
def binner_for_downstream =  nf_binning_tools > 1 && !params.bin_downstream_tool ? params.binc_qc_tool : bin_downstream_tool

@prototaxites
Copy link
Contributor

prototaxites commented Aug 15, 2025

Could we mix both approaches? Have a comma-separated list of binqc tools and a downstream parameter, but we pick the first in the list (with a visible warning!) if the downstream parameter is not explicitly set?

That way, the pipeline will run if a user forgets to specify both parameters - but notify them about the default action:

params.bin_qc_tool = 'busco'

def nr_binning_tools = params.bin_qc_tool.split(',').length()
if (nf_binning_tools > 1 && !params.bin_downstream_tool) { 
    log.warn("[nf-core/mag] WARN: You specified to run multiple Bin QC tools but not which one to use downstream with --bin_downstream_tool. Defaulting to the first in the list, which is ${params.bin_qc_tool.split(',')[0]}") 
}
def binner_for_downstream =  nf_binning_tools > 1 && !params.bin_downstream_tool ? params.binc_qc_tool.split(",")[0] : params.bin_qc_tool

In either case it also requires a check to make sure that bin_downstream_tool is actually in bin_qc_tool.

@harper357
Copy link
Contributor Author

But diverging from option 2:

Could we mix both approaches?

Doesn't option 0 already solve these with fewer changes? I am just a little confused because you are both talking about adding a new parameter to effectively replace binqc_tool and change the usage of binqc_tool, while option 0 keeps binqc_tool as is and just add an optional parameter. This maintains backwards compatibility, and doesn't require that people specify a tool in multiple places.

more prone to user typos and validation of these means writing more custom validation code...

This is already taken care of by the parameter schema, which is a "simple" regex.

@prototaxites
Copy link
Contributor

My problem with option 0 is that it splits the same information (which bin QC tools to run) over multiple parameters. With this way, we specify the tools with one parameter and choose a downstream processor with another.

@harper357
Copy link
Contributor Author

I get that, I just think having a contextual parameter (binqc_tool being order dependent if you don't use bin_downstream_tool and not order dependent if you do) is more clunky having one parameter for "run and use this tool's output" and another for "run these tools and don't use the outputs".

If i am in the minority in thinking that, then ill stop trying to convince you

@jfy133
Copy link
Member

jfy133 commented Aug 16, 2025

I'm still against using ordering to guide anything. It is an implicit behaviour which people will not intuitively understand based on typical Unix behaviour (which at least for me I can't think of any tool that uses ordering for things...).

My approach is that if you want to have your cake and eat it , you need to be responsible for it. I.e. if you truly want to run multiple QC tools, you need to be responsible in picking which one to use for downstream steps - this is an important choice that could have quite big ramifications ) thus what I mean by I like being more declarative and upfront)

That said I agree with Jim that I don't like splitting the same 'functionality' across two parameters.

We should always go broad to specific as you are narrowing down analysis to be more precise in downstream steps

I also think that if you want to resume but switch QC tool you want to to use downstream, you would only have to switch the one parameter to pick another of your choices rather than shuffle options between the two:

--binqc_tool checkm,busco,checkm2
--downstream checkm

## to

--binqc_tool checkm,busco,checkm2
--downstream busco

Va

--binqc_tool checkm
--extra busco,checkm2

## to

--binqc_tool busco
--extra checkm,checkm2

(Assuming I am understanding option 0 directly)

jfy133 and others added 28 commits October 6, 2025 17:58
…dmundmiller hack to get SPADES to work on Fusion
Co-authored-by: James A. Fellows Yates <[email protected]>
@harper357
Copy link
Contributor Author

@jfy133 if you have time to work on this in the next few days, I might have to hand it off to you. I won't have time to work on it again for a few days.

In short, I added TODOs to all the spots/files I think need updating and then started from the beginning of the pipeline updating things. I was trying to keep all the enabled tools summaries in a single channel so it would be easier to add more down the line.

This means that GTDBTK would need to just a single "summaries" channel, split it out and then process each file. Slightly different than what @prototaxites suggested, but hopefully you two like the idea.

Files needing updating:

  • nextflow.config
  • nextflow_schema.json
  • modules.config
  • mag.nf (partial)
  • bin_qc/main.nf
  • gtdbtk/main.nf
  • bin_summary/main.nf
  • combine_tables.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants