Skip to content

Feature/taxonomy submission improvement#38

Merged
Ge94 merged 22 commits intomainfrom
feature/taxonomy_submission_improvement
Oct 8, 2025
Merged

Feature/taxonomy submission improvement#38
Ge94 merged 22 commits intomainfrom
feature/taxonomy_submission_improvement

Conversation

@Ge94
Copy link
Member

@Ge94 Ge94 commented Sep 19, 2025

The taxonomy extraction script has been made into an independent module. A few taxonomic rules have been added based on use cases I had previously found in our genomes and the NCBI taxonomy

This also adds to main documentation refinements that have already been reviewed for the dev branch.

@Ge94 Ge94 marked this pull request as ready for review September 19, 2025 17:23
Copy link
Collaborator

@KateSakharova KateSakharova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!
I have a concern about keeping that script in that repo. We strictly limited taxonomy in genome_uploader and docs to NCBI type. That script has GTDB parser.. why? are you sure it will work for GTDB? why we use converter from gtdb_to_ncbi.py from GTDB-Tk repo then? I'm a bit confused.
I think there are 2 ways:

  • limit to NCBI and remove any parsing/mentioning of GTDB from repo. Add another script that will parse both into toolkit, for example.
  • accept GTDB and convert on fly (that is not possible I guess)

And maybe add cli? (ability to have input file and run not only inside repo)

Functions ena.query_scientific_name and ena.query_taxid are used only for taxon_finder.py. I think we should move those from ena to script directly.

@Ge94
Copy link
Member Author

Ge94 commented Sep 23, 2025

Hey @KateSakharova thanks for your comments. Thanks for catching the GTDB comments, they are remnants of when the script used to have the GTDB taxonomy converter inside, before exporting it to the GGP. I should have removed all instances now.

And maybe add cli? (ability to have input file and run not only inside repo)

I might add this one, since I had to write a wrapper anyway to use it. But I'd like to write a test too if so... I'll add it to my todo list if I have15 minutes spare.

Functions ena.query_scientific_name and ena.query_taxid are used only for taxon_finder.py. I think we should move those from ena to script directly.

Not sure, I thought that since they query ena-api, if we want to re-use them one day it might be better to keep them there context-wise?

@Ge94 Ge94 requested a review from KateSakharova September 23, 2025 12:17
Copy link
Collaborator

@KateSakharova KateSakharova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some changes and minimal test because I had to modify script for myself doing course materials anyway.
I think we can merge it now. I want to unblock you on that PR. But the way we find species for submission sounds weird to me, I didn't know uploader works like that. We will probably need to discuss it in the future for better improvement.
You can add more tests if you want, it should be very easy now. I added just those I have.

@Ge94 Ge94 merged commit 1e34c3c into main Oct 8, 2025
2 checks passed
@Ge94 Ge94 deleted the feature/taxonomy_submission_improvement branch October 8, 2025 04:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants