Skip to content

Temp/fix clade names prompts#20

Merged
jvfe merged 15 commits intofix/clade-names-pendingfrom
temp/fix-clade-names-prompts
Feb 20, 2026
Merged

Temp/fix clade names prompts#20
jvfe merged 15 commits intofix/clade-names-pendingfrom
temp/fix-clade-names-prompts

Conversation

@daniloimparato
Copy link
Copy Markdown
Member

@daniloimparato daniloimparato commented Feb 18, 2026

Diff to check changes to clade naming: e319c36


  1. 1_find_problems.py: Scans the original root name files for inconsistencies (e.g., non-monophyletic taxa or repeated sequences of taxa). It splits these problematic sections into individual files within the problems/ directory and hashes them for tracking.
  2. 2_fix_problems_with_entropy.py: Analyzes the problem files by calculating the entropy of repeated taxa. If the entropy is low (indicating a clear dominant taxon), it automatically generates a fix (.fix.tsv). If the entropy is high (ambiguous), it leaves the problem for further processing.
  3. 3_format_lineages.py: For problems that were not automatically fixed, this script extracts relevant lineage information from a reference file (species_lineage.tsv) and formats it into .lineage.tsv files to provide context for the Large Language Model (LLM).
  4. 4_create_prompts.py: Combines the problem data (.counts.tsv) and the lineage context (.lineage.tsv) with a template (prompt.template.txt) to generate specific prompt files (prompts/*.prompt.txt) that ask the LLM to resolve the ambiguity.
  5. 5_run_prompts.sh: A shell script that executes the generated prompts against the LLM (Gemini) and saves the model's suggested fixes.
  6. 6_apply_fixes.py: Reads the fix_map.tsv and all generated fix files (from both the entropy script and the LLM output) and applies the corrections back to the original root name files, ensuring the final output is clean and formatted correctly.

Copilot AI review requested due to automatic review settings February 18, 2026 21:07
@daniloimparato daniloimparato changed the base branch from main to fix/clade-names-pending February 18, 2026 21:08
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds new clade name data files for various taxonomic identifiers and updates the .gitignore file to exclude certain generated/temporary files from version control.

Changes:

  • Addition of 200+ TSV files containing hierarchical clade name mappings for different taxonomic root identifiers
  • Update to .gitignore to exclude results and test data directories

Reviewed changes

Copilot reviewed 213 out of 215 changed files in this pull request and generated 12 comments.

File Description
db_creation/data/clade_names/*.tsv (200+ files) New data files mapping taxonomic roots to hierarchical clade names
db_creation/.gitignore Added exclusions for results directory and test-related files in pending directory

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread db_creation/data/clade_names/10036_root_names.tsv
Comment thread db_creation/data/clade_names/4081_root_names.tsv
Comment thread db_creation/data/clade_names/63405_root_names.tsv
Comment thread db_creation/data/clade_names/1182553_root_names.tsv
Comment thread db_creation/data/clade_names/113608_root_names.tsv
Comment thread db_creation/data/clade_names/1108849_root_names.tsv
Comment thread db_creation/data/clade_names/3702_root_names.tsv
Comment thread db_creation/data/clade_names/3055_root_names.tsv
Comment thread db_creation/data/clade_names/2903_root_names.tsv
Comment thread db_creation/data/clade_names/5722_root_names.tsv
Copy link
Copy Markdown

Copilot AI commented Feb 19, 2026

@daniloimparato I've opened a new pull request, #21, to work on those changes. Once the pull request is ready, I'll request review from you.

@daniloimparato daniloimparato self-assigned this Feb 19, 2026
@daniloimparato
Copy link
Copy Markdown
Member Author

geneplastdb lives

@jvfe jvfe merged commit ec6025c into fix/clade-names-pending Feb 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants