Skip to content

Add HGNC Gene ID Mapping Dataset and Utilities #52

@enriquea

Description

@enriquea

Problem

Gene identifier mapping is essential for integrating datasets that use different ID schemes. Currently, hvantk lacks support for HGNC (HUGO Gene Nomenclature Committee) identifiers, which are the gold standard
for stable human gene identification.

Key challenges:

  • ClinGen uses hgnc_id (e.g., HGNC:1100) as primary gene identifiers
  • Ensembl uses gene_id (e.g., ENSG00000012048)
  • Many datasets use gene_symbol (e.g., BRCA1), which can change over time or have aliases

Without a unified mapping utility, users must handle ID conversion externally, leading to potential mismatches and data loss.

Proposed Solution

Add HGNC as a new data source with:

  1. HGNC Downloader - Fetch the official HGNC dataset
  2. HGNC Table Builder - Create a Hail Table with gene ID mappings
  3. Gene ID Mapper Utility - Provide bidirectional mapping between ID types

HGNC Data Source

URL: https://www.genenames.org/download/statistics-and-files/

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions