-
Notifications
You must be signed in to change notification settings - Fork 141
Support for running multiple bin QC tools #842
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
Hi, thanks for the PR - the code looks really clean! I have a couple of initial thoughts about the implementation which might be good to discuss before diving into proper review of the code itself.
I agree with you that this seems a bit clunky! I'm not 100% sure what the best option is here, but I can think of two options off the top of my head:
|
subworkflows/local/bin_qc.nf
Outdated
.groupTuple() | ||
} | ||
else { | ||
BUSCO_BUSCO.out.batch_summary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO I would prefer that these are published via the publishDir
directive (and when it arrives in nf-core, the new publishing system) than using collectFile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't publishDir
need to be inside a process? I had originally tried using calls to csvtk/concat
, but had some trouble getting it actually publish the combined summary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does (well, or in nf-core, defined in modules.conf
). The way the subWF currently works passes everything to a csvtk/concat
process called CONCAT_BINQC_TSV
- it should be possible to either clone this code three times (and its config in modules.conf
), and emit the three summaries separately (repetitive), or mix all the outputs from each binQC tool into a single channel with a meta.binqc_tool
, group, and then feed it into the process for up to three iterations of the process. Then we can filter the right binQC summary using the meta object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(and its config in
modules.conf
)
Oh! This is why it didn't work for me originally. Ok, I changed it back and got it working.
IMHO option 2 feels more "natural", and we can clarify that the first tool is the one used to determine which bins get in the taxonomic classification. And them integrate the results from all of them in the bin summary. |
I agree - but just want to flag up that we've explicitly avoided comma-separated lists for parameters in the past, so if we're going to do it we should be mindful about the interface |
That's fair, we should decide if we start including list parameters and probably refactor some parameter to be consistent if we decide too (binning comes to my mind, specially if we plan to include more binners in the future). Personally, I don't see the downsides of using list parameters, they make the implementation of new tools simpler, and the user interface is good too. Other nf-core pipelines have them, for example sarek has a |
@prototaxites Thanks for the quick comments! I am happy to change things if there is a consensus Option 1 is definitely the most explicit, but does make it more more difficult to add more tools in the future compared to what I did. I don't think it makes the logic too much more difficult, but maybe it will when/if Nextflow requires params are explicitly passed to workflows. Option 2 I worry about using an ordered list. To me, this will make things confusing for new users since very few options are order dependent. It would require that people really read the documentation to understand how it works, which in my experience doesn't happen as often as it should. As much as it feels clunky, my implementation does look like a good middle ground between these two options. But again, I will change it if others think it should be done differently. |
This reverts commit 57dcd71.
2869c1a
to
f8a7969
Compare
Hmm, I tried to rebase, but I think I might have messed up the commit history as it shows some extra changes in the file view |
I think the problem with option 1 is that there's a potential incompatible selection of parameters - e.g. the only bin QC tool enabled is BUSCO, but
While I agree that order-dependence would be a new interaction pattern with the pipeline, I don't think it in itself is particularly problematic, if it's documented well. I imagine many users would probably just want to run a single tool, in which case the option is fine and operates how it currently does, with no change to the interface - only users needing the 'advanced' feature would need to read the documentation. And hopefully if they get it wrong, the results would not be too dramatically different in any case. My bigger problem with option 2 is that (aside from changing the schema from a string to a comma-separated list, as discussed above) it would also represent a switch from "on-by-default" tools with (I actually prefer the latter, but don't think we should mix them) |
@prototaxites Just to make sure I understand you, are you suggesting that neither option 1 or 2 should move forward? Implying either my current implementation (for consistency lets call it option 0) or some new option needs to be thought of? Because I was thinking about your message over this last weekend and I think option 0 is still a nice middle ground. I am not sure if this is what you meant, but there are several cases of "off-by-default" tools/subworkflows in this pipeline, ex: |
To clarify, I still largely prefer option 2 - just pointing out that it would be a change compared to the way many parameters are specified. That's a good point about there already being some @jfy133 Wonder if you have any opinions on this? |
Finally catching up - and thank you for the patience. Actually I personally generally prefer to be more up-front declarative. I never really liked comma separated lists as user-wise they are more prone to user typos and validation of these means writing more custom validation code. Booleans are easier to handle with options etc. (and this is what we already do). That said, I agree that we should not mix two approaches though and be considerate of not breaking 'backwards compatibility' too much. Given we have But diverging from option 2: not worry about ordering: we can the have the second flag then to pick the exact tool to use for downstream tools. When the number of binners in What does everyone think? |
Pseudo code:
|
Could we mix both approaches? Have a comma-separated list of binqc tools and a That way, the pipeline will run if a user forgets to specify both parameters - but notify them about the default action:
In either case it also requires a check to make sure that |
Doesn't option 0 already solve these with fewer changes? I am just a little confused because you are both talking about adding a new parameter to effectively replace
This is already taken care of by the parameter schema, which is a "simple" regex. |
My problem with option 0 is that it splits the same information (which bin QC tools to run) over multiple parameters. With this way, we specify the tools with one parameter and choose a downstream processor with another. |
I get that, I just think having a contextual parameter ( If i am in the minority in thinking that, then ill stop trying to convince you |
I'm still against using ordering to guide anything. It is an implicit behaviour which people will not intuitively understand based on typical Unix behaviour (which at least for me I can't think of any tool that uses ordering for things...). My approach is that if you want to have your cake and eat it , you need to be responsible for it. I.e. if you truly want to run multiple QC tools, you need to be responsible in picking which one to use for downstream steps - this is an important choice that could have quite big ramifications ) thus what I mean by I like being more declarative and upfront) That said I agree with Jim that I don't like splitting the same 'functionality' across two parameters. We should always go broad to specific as you are narrowing down analysis to be more precise in downstream steps I also think that if you want to resume but switch QC tool you want to to use downstream, you would only have to switch the one parameter to pick another of your choices rather than shuffle options between the two:
Va
(Assuming I am understanding option 0 directly) |
…dmundmiller hack to get SPADES to work on Fusion
…re on exit code 1 for MAG_DEPTHS_PLOTS,
…ue to summarise problem
…ent identity parameters, and why
Co-authored-by: James A. Fellows Yates <[email protected]>
@jfy133 if you have time to work on this in the next few days, I might have to hand it off to you. I won't have time to work on it again for a few days. In short, I added TODOs to all the spots/files I think need updating and then started from the beginning of the pipeline updating things. I was trying to keep all the enabled tools summaries in a single channel so it would be easier to add more down the line. This means that GTDBTK would need to just a single "summaries" channel, split it out and then process each file. Slightly different than what @prototaxites suggested, but hopefully you two like the idea. Files needing updating:
|
This PR adds support for running multiple bin QC tools. Fixing issue #713 .
It does this by adding an extra parameter,
binqc_tool_extras
where users can specify extra tools to run and then just saving the outputs, but not using them for any downstream process.This will probably have to be rebased after PR#841.
Not sure if this needs any new tests.
I have this marked as a draft because I am not completely happy with having to add a new parameter, and it doesn't add everything into the bin summary, but I am not sure if those are deal breakers.
PR checklist
nf-core pipelines lint
).docs/output.md
is updated.CHANGELOG.md
is updated.