-
Notifications
You must be signed in to change notification settings - Fork 619
Map dropped duplicate VIDs [VS-1757] #9292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ropped_duplicates
…into vs_1757_dropped_duplicates
|
Github actions tests reported job failures from actions build 18791251664
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR automates the mapping of dropped duplicate VIDs by introducing a new workflow GvsMapDroppedDuplicateVIDs.wdl and refactoring existing code for clarity. The changes ensure that participant mappings include all samples for variant synonyms with AC != 0, not just those whose input synonym was left-aligned.
- Adds new workflow to map dropped duplicate VIDs to participants
- Renames files and variables for clarity (
mapping_table_name→participant_mapping_table_name, script renames) - Updates documentation with instructions for the new mapping step
Reviewed Changes
Copilot reviewed 11 out of 19 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/variantstore/wdl/GvsUtils.wdl | Updates Docker image version for variants container |
| scripts/variantstore/variant-annotations-table/GvsMapUnmappedVIDs.wdl | Renames mapping_table_name to participant_mapping_table_name and updates script references |
| scripts/variantstore/variant-annotations-table/GvsMapDroppedDuplicateVIDs.wdl | New workflow for mapping dropped duplicate VIDs to participants |
| scripts/variantstore/variant-annotations-table/GvsCreateParticipantMappingTable.wdl | Renames mapping_table_name to participant_mapping_table_name throughout |
| scripts/variantstore/scripts/variant_annotation_table/left_alignment_fixups/map_input_alignments_to_left_alignments.py | Updates docstring to clarify script purpose |
| scripts/variantstore/scripts/variant_annotation_table/left_alignment_fixups/generate_bcftools_searches_for_variant_synonyms.py | Updates docstring, fixes start position calculation |
| scripts/variantstore/scripts/variant_annotation_table/left_alignment_fixups/README.md | Updates script references to use new filenames |
| scripts/variantstore/scripts/variant_annotation_table/dropped_duplicates/README.md | Updates script reference to use new filename |
| scripts/variantstore/scripts/Dockerfile | Updates path to reflect directory rename |
| scripts/variantstore/docs/aou/AOU_DELIVERABLES.md | Adds documentation for new dropped duplicate VID mapping step |
| .dockstore.yml | Adds new workflow registration and updates branch references |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| search_range = 200 | ||
|
|
||
| start = position + 1 | ||
| start = position |
Copilot
AI
Oct 29, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing start = position + 1 to start = position means the search range now includes the variant's exact position. This could cause the search to match the input variant itself when it appears at that position in the VCF, potentially creating false positive matches. If the intent is to search for synonyms at different positions, the original position + 1 offset was likely correct.
| start = position | |
| start = position + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left aligned synonyms should not occur in unmapped VIDs, so it's true that in the unmapped VID use case we will never find any synonyms at position. This was the logic behind starting the search at position + 1 when this code was only used for searching for unmapped VID synonyms.
For the dropped duplicate use case, we absolutely want to consider synonyms that are already left-aligned as these are often present. Starting the search at position should work for both the unmapped VID and dropped duplicate use cases, even if the search never finds anything at the very first position for the unmapped VID use case.
scripts/variantstore/variant-annotations-table/GvsMapDroppedDuplicateVIDs.wdl
Outdated
Show resolved
Hide resolved
…plicateVIDs.wdl Co-authored-by: Copilot <[email protected]>
gbggrant
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. A couple of (hopefully minor Qs)
| primaryDescriptorPath: /scripts/variantstore/variant-annotations-table/GvsMapDroppedDuplicateVIDs.wdl | ||
| filters: | ||
| branches: | ||
| - master |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It occurs to me most of our WDLs aren't in 'master' - but I guess there's no harm in having this here? (Or maybe I'm not understanding things).
| # 13. Remove existing mappings for these duplicates from the mapping table | ||
| # | ||
| # bq query --max_rows check: ok delete |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this commented out bq query here for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is carrying on the pattern of auditing bq querys from VS-1396 motivated by that unfortunate incident in Echo where we silently truncated output to 100 results. This is just saying that this query has been audited and is okay because it's not returning results, only performing a delete.
| # 14. Write all the person mappings for these duplicate VIDs | ||
| # | ||
| # bq query --max_rows check: ok insert |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same Q as above
| # | ||
| bq --apilog=false query --nouse_legacy_sql --project_id=~{project} --format=csv ' | ||
| DELETE `~{dataset}.~{participant_mapping_table_name}` WHERE vid IN ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not DELETE FROM ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FROM appears to be optional in BigQuery syntax https://docs.cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax#delete_statement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sad. I like the FROM
Automate the mapping of dropped duplicate VIDs, with docs. Lots of cleanup included, so the three steps of participant mapping were re-run over a narrow range: