Temp/fix clade names prompts#20
Merged
jvfe merged 15 commits intofix/clade-names-pendingfrom Feb 20, 2026
Merged
Conversation
added 12 commits
February 13, 2026 21:54
This reverts commit 07c58de.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds new clade name data files for various taxonomic identifiers and updates the .gitignore file to exclude certain generated/temporary files from version control.
Changes:
- Addition of 200+ TSV files containing hierarchical clade name mappings for different taxonomic root identifiers
- Update to
.gitignoreto exclude results and test data directories
Reviewed changes
Copilot reviewed 213 out of 215 changed files in this pull request and generated 12 comments.
| File | Description |
|---|---|
| db_creation/data/clade_names/*.tsv (200+ files) | New data files mapping taxonomic roots to hierarchical clade names |
| db_creation/.gitignore | Added exclusions for results directory and test-related files in pending directory |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@daniloimparato I've opened a new pull request, #21, to work on those changes. Once the pull request is ready, I'll request review from you. |
Member
Author
|
geneplastdb lives |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Diff to check changes to clade naming: e319c36
1_find_problems.py: Scans the original root name files for inconsistencies (e.g., non-monophyletic taxa or repeated sequences of taxa). It splits these problematic sections into individual files within the problems/ directory and hashes them for tracking.2_fix_problems_with_entropy.py: Analyzes the problem files by calculating the entropy of repeated taxa. If the entropy is low (indicating a clear dominant taxon), it automatically generates a fix (.fix.tsv). If the entropy is high (ambiguous), it leaves the problem for further processing.3_format_lineages.py: For problems that were not automatically fixed, this script extracts relevant lineage information from a reference file (species_lineage.tsv) and formats it into .lineage.tsv files to provide context for the Large Language Model (LLM).4_create_prompts.py: Combines the problem data (.counts.tsv) and the lineage context (.lineage.tsv) with a template (prompt.template.txt) to generate specific prompt files (prompts/*.prompt.txt) that ask the LLM to resolve the ambiguity.5_run_prompts.sh: A shell script that executes the generated prompts against the LLM (Gemini) and saves the model's suggested fixes.6_apply_fixes.py: Reads the fix_map.tsv and all generated fix files (from both the entropy script and the LLM output) and applies the corrections back to the original root name files, ensuring the final output is clean and formatted correctly.