Skip to content

Conversation

@mcovarr
Copy link
Collaborator

@mcovarr mcovarr commented Oct 20, 2025

Automate the mapping of dropped duplicate VIDs, with docs. Lots of cleanup included, so the three steps of participant mapping were re-run over a narrow range:

@mcovarr mcovarr changed the base branch from ah_var_store to vs_1686_pseudo_vid_fixup October 20, 2025 22:14
Base automatically changed from vs_1686_pseudo_vid_fixup to ah_var_store October 21, 2025 19:14
@gatk-bot
Copy link

Github actions tests reported job failures from actions build 18791251664
Failures in the following jobs:

Test Type JDK Job ID Logs
conda 17.0.6+10 18791251664.3 logs

@mcovarr mcovarr changed the title map dropped duplicates [VS-1757] Map dropped duplicate VIDs [VS-1757] Oct 29, 2025
@mcovarr mcovarr marked this pull request as ready for review October 29, 2025 18:41
@mcovarr mcovarr requested a review from Copilot October 29, 2025 18:41
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR automates the mapping of dropped duplicate VIDs by introducing a new workflow GvsMapDroppedDuplicateVIDs.wdl and refactoring existing code for clarity. The changes ensure that participant mappings include all samples for variant synonyms with AC != 0, not just those whose input synonym was left-aligned.

  • Adds new workflow to map dropped duplicate VIDs to participants
  • Renames files and variables for clarity (mapping_table_nameparticipant_mapping_table_name, script renames)
  • Updates documentation with instructions for the new mapping step

Reviewed Changes

Copilot reviewed 11 out of 19 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
scripts/variantstore/wdl/GvsUtils.wdl Updates Docker image version for variants container
scripts/variantstore/variant-annotations-table/GvsMapUnmappedVIDs.wdl Renames mapping_table_name to participant_mapping_table_name and updates script references
scripts/variantstore/variant-annotations-table/GvsMapDroppedDuplicateVIDs.wdl New workflow for mapping dropped duplicate VIDs to participants
scripts/variantstore/variant-annotations-table/GvsCreateParticipantMappingTable.wdl Renames mapping_table_name to participant_mapping_table_name throughout
scripts/variantstore/scripts/variant_annotation_table/left_alignment_fixups/map_input_alignments_to_left_alignments.py Updates docstring to clarify script purpose
scripts/variantstore/scripts/variant_annotation_table/left_alignment_fixups/generate_bcftools_searches_for_variant_synonyms.py Updates docstring, fixes start position calculation
scripts/variantstore/scripts/variant_annotation_table/left_alignment_fixups/README.md Updates script references to use new filenames
scripts/variantstore/scripts/variant_annotation_table/dropped_duplicates/README.md Updates script reference to use new filename
scripts/variantstore/scripts/Dockerfile Updates path to reflect directory rename
scripts/variantstore/docs/aou/AOU_DELIVERABLES.md Adds documentation for new dropped duplicate VID mapping step
.dockstore.yml Adds new workflow registration and updates branch references

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

search_range = 200

start = position + 1
start = position
Copy link

Copilot AI Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changing start = position + 1 to start = position means the search range now includes the variant's exact position. This could cause the search to match the input variant itself when it appears at that position in the VCF, potentially creating false positive matches. If the intent is to search for synonyms at different positions, the original position + 1 offset was likely correct.

Suggested change
start = position
start = position + 1

Copilot uses AI. Check for mistakes.
Copy link
Collaborator Author

@mcovarr mcovarr Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left aligned synonyms should not occur in unmapped VIDs, so it's true that in the unmapped VID use case we will never find any synonyms at position. This was the logic behind starting the search at position + 1 when this code was only used for searching for unmapped VID synonyms.

For the dropped duplicate use case, we absolutely want to consider synonyms that are already left-aligned as these are often present. Starting the search at position should work for both the unmapped VID and dropped duplicate use cases, even if the search never finds anything at the very first position for the unmapped VID use case.

Copy link
Collaborator

@gbggrant gbggrant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. A couple of (hopefully minor Qs)

primaryDescriptorPath: /scripts/variantstore/variant-annotations-table/GvsMapDroppedDuplicateVIDs.wdl
filters:
branches:
- master
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me most of our WDLs aren't in 'master' - but I guess there's no harm in having this here? (Or maybe I'm not understanding things).

# 13. Remove existing mappings for these duplicates from the mapping table
#
# bq query --max_rows check: ok delete
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this commented out bq query here for?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is carrying on the pattern of auditing bq querys from VS-1396 motivated by that unfortunate incident in Echo where we silently truncated output to 100 results. This is just saying that this query has been audited and is okay because it's not returning results, only performing a delete.

# 14. Write all the person mappings for these duplicate VIDs
#
# bq query --max_rows check: ok insert
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same Q as above

#
bq --apilog=false query --nouse_legacy_sql --project_id=~{project} --format=csv '
DELETE `~{dataset}.~{participant_mapping_table_name}` WHERE vid IN (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not DELETE FROM ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sad. I like the FROM

@mcovarr mcovarr merged commit fde4fb4 into ah_var_store Nov 4, 2025
21 checks passed
@mcovarr mcovarr deleted the vs_1757_dropped_duplicates branch November 4, 2025 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants