Skip to content

[MaveQuest] Update Source Data

Jochen Weile edited this page Jun 5, 2023 · 1 revision

Resources

Prepare repository

  1. Check out the mavequest-datasources and the mavequest-importer repositories
gh repo clone rothlab/mavequest-datasources
gh repo clone rothlab/mavequest-importer
  1. Set mavequest-datasources as the working directory

Update Gene Info

  1. Go to geneInfo folder

  2. Prepare the HGNC gene set. Download the current release from http://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/tsv/locus_types/gene_with_protein_product.txt.

  3. Open parseHGNC.R

  4. Make sure the Global Variables are up-to-date (access date)

    1. HGNC_DATABASE_FILE_PATH points to the HGNC complete gene set that was downloaded in step 2.

    2. OUTPUT_FILE_PATH is the name of the output file.

    3. CACHED_CANONICAL_FILE_PATHS points to the cached canonical isoforms from Ensembl and Uniprot databases.

      • We recommend against using the cached isoform files. Each time you update this data source, you should use fresh canoncial isoforms. However, because it takes ~2 hours to query both databases, if you had to debug this step, it is faster to use the cached the isoform files.
      • If you do not want to use cached file for a certain database, set the corresponding element to NA. For example, if you do not want to cache ensembl results, set the variable to:

      CACHED_CANONICAL_FILE_PATHS = c("ensembl" = NA, "uniprot” = “cached_file.rds”)

  5. Run parseHGNC.R

  6. Update the version and access date of HGNC in ../mavequest-importer/databaseVersions.json

Update Ambry Genetics Test Catalog

  1. Go toambrygen folder
  2. Open scrapeAmbry.R
  3. Make sure the Global Variables are up-to-date
    1. GENE_INFO_FILE_PATH points to the gene info file that contains the unique gene ID that will be attached to the output file.
    2. OUTPUT_FILE_PATH is the name of the output file.
  4. Run scrapeAmbry.R
  5. Update the access date of Ambry Test Catalog in ../mavequest-importer/databaseVersions.json

Update Cancer Gene Census

  1. Go tocancer_census folder

  2. Prepare the Cancer Gene Census dataset. Download the current release from https://cancer.sanger.ac.uk/census using the "Export CSV" function (image below). ::You will need a COSMIC account to download data.::

    Image_bin_preview

  3. Open parseCancerCensus.R

  4. Make sure the Global Variables are up-to-date

    1. CANCER_CENSUS_FILE_PATH points to the Cancer Census dataset that was downloaded in step 2.
    2. GENE_INFO_FILE_PATH points to the gene info file that contains the unique gene ID that will be attached to the output file.
    3. OUTPUT_FILE_PATH is the name of the output file.
  5. Run parseCancerCensus.R

  6. Update the version, release date and access date of Cancer Gene Census in ../mavequest-importer/databaseVersions.json

    1. You can find the version and release date on the COSMIC front page: https://cancer.sanger.ac.uk/cosmic

      Image (2)_bin_preview

Update ClinVar

  1. Go to clinvar folder
  2. Prepare the clinker variant set. Download and unzip the current release from https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz.
  3. Open parseClinvar.R
  4. Make sure the Global Variables are up-to-date
    1. CLINVAR_FILE_PATH points to the ClinVar dataset that was downloaded in step 2.
    2. GENE_INFO_FILE_PATH points to the gene info file that contains the unique gene ID that will be attached to the output file.
    3. OUTPUT_FILE_PATH is the name of the output file.
  5. Run parseClinvar.R
  6. Update the version and access date of ClinVar in ../mavequest-importer/databaseVersions.json
    1. You can find the version (for ClinVar, it’s the month and year of the release) in the release notes folder: https://ftp.ncbi.nlm.nih.gov/pub/clinvar/release_notes/

Update GeneDx Test Catalog

  1. Go to genedx folder
  2. Open scrapeGeneDx.R
  3. Make sure the Global Variables are up-to-date
    1. GENE_INFO_FILE_PATH points to the gene info file that contains the unique gene ID that will be attached to the output file.
    2. OUTPUT_FILE_PATH is the name of the output file.
  4. Run scrapeGeneDx.R
  5. Update the access date of GeneDx Test Catalog in ../mavequest-importer/databaseVersions.json

Update GenomeCRISPR

As of Jan 2022, GenomeCRISPR is not updated by the maintainer. The download link is broken. We will be using a previously downloaded version. ::We still need to run the update script to match genes with new indices (generated from the Update Gene Info).

Please still check the GenomeCRISPR website to see if there's a new version: http://genomecrispr.dkfz.de/

If no new versions, we do not need to update databaseVersions.json

  1. Go to genome_crispr folder
  2. Open parseGenomeCRISPRDB.R
  3. Make sure the Global Variables are up-to-date
    1. GENOMECRISPR_FULL_FILE_PATH points to the full GenomeCRISPR database that was released in May 2017.
    2. GENOMECRISPR_ADDITIONAL_FILE_PATH points to the additional dataset that was shared to us by the authors in Sept 2019.
    3. GENE_INFO_FILE_PATH points to the gene info file that contains the unique gene ID that will be attached to the output file.
    4. OUTPUT_HIT_FILE_PATH and OUTPUT_HITSUM_FILE_PATH are the name of the output files.
  4. Run parseGenomeCRISPRDB.R

Update GenomeRNAi

As of Jan 2022, GenomeRNAi is not updated by the maintainer. The download link is broken. We will be using a previously downloaded version. ::We still need to run the update script to match genes with new indices (generated from the Update Gene Info).

Please still check the GenomeRNAi website to see if there's a new version: http://www.genomernai.org/

If no new versions, we do not need to update databaseVersions.json

  1. Go to genome_rnai folder
  2. Open parseGenomeRNAi.R
  3. Make sure the Global Variables are up-to-date
    1. RNAi_FILE_PATH points to the full GenomeRNAi database that was released in May 2017.
    2. GENE_INFO_FILE_PATH points to the gene info file that contains the unique gene ID that will be attached to the output file.
    3. OUTPUT_FILE_PATH is the name of the output files.
  4. Run parseGenomeRNAi.R

Update Human Protein-Protein Interactions

We are using human protein-protein interactions from the Human Interactome Project (HIP, http://www.interactome-atlas.org/). The only exception is the literature curated set Lit-BM. Because the Lit-BM data on the HIP website do not include essential metadata (e.g. type of interaction, source, discovery method), we used a Lit-BM file (Lit-BM-17) that was provided to us by HIP maintainers.

As the HuRI dataset has been released in 2021, the maintainers have not added any new interactions. Please still check the HuRI website to see if there’s new data released: http://www.interactome-atlas.org/

If no new data, we do not need to update databaseVersions.json

  1. Go to huri folder
  2. Prepare the HIP variant set. Download HuRI.psi and HI-union.psi from http://www.interactome-atlas.org/download.
  3. Open parseHuRI.R
  4. Make sure the Global Variables are up-to-date
    1. HORF71_FILE_PATH and HORF81_FILE_PATH point to the human ORFeome datasets which are required to map ORF IDs to Gene Symbol for some datasets (e.g. HuRI.psi).
    2. MISSING_ORF_IDS_FILE_PATH and MISSING_GENE_SYMBOLS_FILE_PATH point to manually mapped missing ORF IDs and Gene Symbols. These files help to map ORF IDs and Gene Symbols that cannot be mapped automatically.
    3. HURI_FILE_PATH, HI_UNION_FILE_PATH, LITMB_FILE_PATH point to HIP dataset downloaded in step 2.
    4. PUBMED_API_ACCESS_KEY set to the API access key from PubMed. The API key is required to query NCBI API. See documentation here: https://support.nlm.nih.gov/knowledgebase/article/KA-05317/en-us
    5. GENE_INFO_FILE_PATH points to the gene info file that contains the unique gene ID that will be attached to the output file.
    6. OUTPUT_FILE_PATH is the name of the output files.
  5. Run parseHuRI.R

Update InterPro database

  1. Go to interpro folder
  2. Open getInterpro.R
  3. Make sure the Global Variables are up-to-date
    1. API_GENE and API_ENTRY_INFO point to the InterPro API endpoints.
    2. GENE_INFO_FILE_PATH points to the gene info file that contains the unique gene ID that will be attached to the output file.
    3. OUTPUT_FILE_PATH is the name of the output files.
  4. Run getInterpro.R
  5. Update version, release and access dates of InterPro in ../mavequest-importer/databaseVersions.json

Update Invitae Testing Catalog

  1. Go to invitae folder
  2. Open scrapeInvitae.R
  3. Make sure the Global Variables are up-to-date
    1. INVITAE_CATALOGS, INVITAE_PREFIX and INVITAE_TEST_PREFIX point to Invitae website pages that we will crawl.
    2. GENE_INFO_FILE_PATH points to the gene info file that contains the unique gene ID that will be attached to the output file.
    3. OUTPUT_FILE_PATH is the name of the output files.
  4. Run scrapeInvitae.R
  5. Update the access date of Invitae Testing Catalog in ../mavequest-importer/databaseVersions.json

Update MaveDB entries

  1. Go to mavedb folder
  2. Open curateMaveDB.R
  3. Make sure the Global Variables are up-to-date
    1. API_MAVEDB points to the MaveDB API.
    2. GENE_INFO_FILE_PATH points to the gene info file that contains the unique gene ID that will be attached to the output file.
    3. OUTPUT_FILE_PATH is the name of the output files.
  4. Run curateMaveDB.R
  5. Update the version, release date and access dates of MaveDB in ../mavequest-importer/databaseVersions.json

Update OGEE (Online GEne Essentiality) database

As of Jan 2022, the most recent version of OGEE (version 3) no longer provides a breakdown of human essential genes with associated studies. However, because version 3 does not include any new human essentiality studies, we can simply use the cached version 2 OGEE data dump.

You should still check the OGEE website in case new data is released: https://v3.ogee.info/#/home

If no new data, no need to update databaseVersions.json

  1. Go to ogee folder
  2. Open parseOGEE.R
  3. Make sure the Global Variables are up-to-date
    1. OGEE_GENES_FILE_PATH and OGEE_STUDIES_FILE_PATH point to the OGEE data dump.
    2. GENE_INFO_FILE_PATH points to the gene info file that contains the unique gene ID that will be attached to the output file.
    3. OUTPUT_GENES_FILE_PATH and OUTPUT_STUDIES_FILE_PATH point to the output files.
  4. Run parseOGEE.R

Update OMIM (Online Mendelian Inheritance in Man) database

As of Jan 2022, OMIM requires an account to download data. Make sure you submit your data request at https://omim.org/downloads. Once your data request is approved, you will receive an email from OMIM with the link to download the dataset (genemap2.txt).

  1. Go to omim folder
  2. Prepare the OMIM dataset by downloading genemap2.txt using the personalized link sent to you from OMIM team.
  3. Open parseOMIM.R
  4. Make sure the Global Variables are up-to-date
    1. OMIM_FILE_PATH points to the OMIM dataset downloaded in Step 2.
    2. GENE_INFO_FILE_PATH points to the gene info file that contains the unique gene ID that will be attached to the output file.
    3. OUTPUT_FILE_PATH is the name of the output files.
  5. Run parseOMIM.R
  6. Update the version, release date and access dates of OMIM in ../mavequest-importer/databaseVersions.json

Update Orphanet

  1. Go to orphanet folder
  2. Prepare the Orphanet rare disease set. Download:
    1. genes associated with rare diseases (en_product6.xml) from http://www.orphadata.org/data/xml/en_product6.xml.
    2. rare disease prevalence (en_product9_prev.xml) from http://www.orphadata.org/data/xml/en_product9_prev.xml.
  3. Open parseOrphanet.R
  4. Make sure the Global Variables are up-to-date
    1. ORPHANET_GENES_FILE_PATH and ORPHANET_DISEASES_FILE_PATH point to the Orphanet dataset downloaded in Step 2.
    2. GENE_INFO_FILE_PATH points to the gene info file that contains the unique gene ID that will be attached to the output file.
    3. OUTPUT_FILE_PATH is the name of the output files.
  5. Run parseOrphanet.R
  6. Update the version, release date and access dates of Orphanet in ../mavequest-importer/databaseVersions.json

Update AGR (Alliance of Genome Resources)

  1. Go to orthology/agr folder
  2. Prepare AGR dataset. Download:
    1. Human gene descriptions (tsv format) from https://www.alliancegenome.org/downloads#gene-descriptions.

      Image (3)_bin_preview

    2. Orthology (tsv format) from https://www.alliancegenome.org/downloads#orthology.

  3. Open parseAGR.R
  4. Make sure the Global Variables are up-to-date
    1. AGR_GENE_DESCRIPTION_FILE_PATH and AGR_ORTHOLOGY_FILE_PATH point to the AGR dataset downloaded in Step 2.
    2. GENE_INFO_FILE_PATH points to the gene info file that contains the unique gene ID that will be attached to the output file.
    3. OUTPUT_FILE_PATH is the name of the output files.
  5. Run parseAGR.R
  6. Update the version, release date and access date of AGR in ../mavequest-importer/databaseVersions.json

Update InParanoid dataset

As of Jan 2022, the InParanoid is no longer updated. As a result, we will not need to download new data. You should still check the InParanoid website in case new data is released: https://inparanoid.sbc.su.se/cgi-bin/index.cgi If no new data, no need to update databaseVersions.json

  1. Go to orthology/inparanoid folder
  2. Open parseInparanoid.R
  3. Make sure the Global Variables are up-to-date
    1. INPARANOID_SCEREVISIAE_FILE_PATH and INPARANOID_SPOMBE_FILE_PATH point to the Inparanoid dataset.
    2. GENE_INFO_FILE_PATH points to the gene info file that contains the unique gene ID that will be attached to the output file.
    3. OUTPUT_FILE_PATH is the name of the output files.
  4. Run parseInparanoid.R

Manually add homologs from papers

As of Jan 2022, we add homologs from two papers: Yang et al., 2017 and Hamza et al., 2020. Two other papers (Kachroo et al., 2015 and Sun et al., 2016) were considered but not added because they are already in SGD. Note, please make sure you install OpenJDK and point the JAVA_HOME environment variable to the JDK directory.

  1. Go to orthology/papers folder
  2. Open parseCompFromPapers.R
  3. Make sure the Global Variables are up-to-date
    1. PAPERS_FILE_PATHS points to homologs from papers.
    2. GENE_INFO_FILE_PATH points to the gene info file that contains the unique gene ID that will be attached to the output file.
    3. OUTPUT_FILE_PATH is the name of the output files.
  4. Run parseCompFromPapers.R

Update PomBase

  1. Go to orthology/pombase folder
  2. Prepare the PomBase ortholog set. Download and unzip the current release from https://www.pombase.org/data/orthologs/human-orthologs.txt.gz.
  3. Open parsePombase.R
  4. Make sure the Global Variables are up-to-date
    1. POMBASE_FILE_PATH points to the PomBase dataset that was downloaded in step 2.
    2. API_POMBASE points to the PomBase API endpoint.
    3. GENE_INFO_FILE_PATH points to the gene info file that contains the unique gene ID that will be attached to the output file.
    4. OUTPUT_FILE_PATH is the name of the output file.
  5. Update release and access dates of AGR in ../mavequest-importer/databaseVersions.json

Update P-POD (Princeton Protein Orthology Database)

As of Jan 2022, the P-POD is no longer updated. As a result, we will not need to download new data. You should still check the InParanoid website in case new data is released: http://ppod.princeton.edu/ If no new data, no need to update databaseVersions.json

  1. Go to orthology/ppod folder
  2. Open parsePpod.R
  3. Make sure the Global Variables are up-to-date
    1. PPOD_FILE_PATH points to the P-POD dataset.
    2. GENE_INFO_FILE_PATH points to the gene info file that contains the unique gene ID that will be attached to the output file.
    3. OUTPUT_FILE_PATH is the name of the output files.
  4. Run parsePpod.R

Update SGD (Saccharomyces Genome Database)

  1. Go to orthology/sgd folder
  2. Prepare the SGD ortholog set. Download and unzip the current release from http://sgd-archive.yeastgenome.org/curation/literature/functional_complementation.tab.
  3. Open parseSGD.R
  4. Make sure the Global Variables are up-to-date
    1. SGD_FILE_PATH points to the P-POD dataset.
    2. GENE_INFO_FILE_PATH points to the gene info file that contains the unique gene ID that will be attached to the output file.
    3. OUTPUT_FILE_PATH is the name of the output files.
  5. Run parsePpod.R

Combine homology data sources

  1. Go to orthology folder
  2. Open mergeOrthology.R
  3. Make sure the Global Variables are up-to-date
    1. ORTHO_FILE_PATHs point to homology data sources.
    2. OUTPUT_FILE_PATH is the name of the output files.
  4. Run mergeOrthology.R

Manually add over-expression datasets from papers

As of Jan 2022, we add homologs from five papers: Liu et al., 2007, Gilbert et al., 2014, Konermann et al., 2015, Horlbeck et al., 2016, Duffy et al., 2016.

  1. Go to overexpression folder
  2. Open parseOverexpression.R
  3. Make sure the Global Variables are up-to-date
    1. OVEREXPRESSION_FILE_PATH points to the over-expression dataset.
    2. GENE_INFO_FILE_PATH points to the gene info file that contains the unique gene ID that will be attached to the output file.
    3. OUTPUT_FILE_PATH is the name of the output files.
  4. Run parseOverexpression.R

Update PharmGKB database

  1. Go to pharmgkb folder
  2. Prepare PharmGKB dataset. Download:
    1. Variant annotation summary from https://api.pharmgkb.org/v1/download/file/data/variantAnnotations.zip. Unzip and copy the var_pheno_ann.tsv file to the folder.
    2. Clinical variant data from https://api.pharmgkb.org/v1/download/file/data/clinicalVariants.zip. Unzip and copy the clinicalVariants.tsv file to the folder.
    3. Drug label annotations from https://api.pharmgkb.org/v1/download/file/data/drugLabels.zip. Unzip and copy the drugLabels.byGene.tsv file to the folder.
  3. Open parsePharmGKB.R
  4. Make sure the Global Variables are up-to-date
    1. PGKB_VAR_ANNOTATIONS_FILE_PATH, PGKB_CLIN_VARIANTS_FILE_PATH and PGKB_DRUG_LABELS_FILE_PATH point to the PharmGKB dataset downloaded in Step 2.
    2. GENE_INFO_FILE_PATH points to the gene info file that contains the unique gene ID that will be attached to the output file.
    3. OUTPUT_FILE_PATH is the name of the output files.
  5. Run parsePharmGKB.R
  6. Update release and access dates of AGR in ../mavequest-importer/databaseVersions.json

Update Secondary Structures from UniProt

  1. Go to secondary_structure folder
  2. Open getSecondaryStructureFromUniprot.R
  3. Make sure the Global Variables are up-to-date
    1. API_ENDPOINT points to the Uniprot API.
    2. GENE_INFO_FILE_PATH points to the gene info file that contains the unique gene ID that will be attached to the output file.
    3. OUTPUT_DIR_PATH and OUTPUT_FILE_PATH are the name of the output files.
  4. Run getSecondaryStructureFromUniprot.R
  5. Update version, release and access dates of UniProt in ../mavequest-importer/databaseVersions.json

Update BioGRID ORCS

  1. Go to biogrid_orcs folder
  2. Download the latest BioGRID ORCS data (BIOGRID-ORCS-ALL-homo_sapiens-LATEST.screens.tar.gz) from: https://downloads.thebiogrid.org/File/BioGRID-ORCS/Latest-Release/BIOGRID-ORCS-ALL-homo_sapiens-LATEST.screens.tar.gz. Unzip and copy the screen folder to the working folder.
  3. Open processBioGridORCS.R
  4. Make sure the Global Variables are up-to-date
    1. INPUT_DIR_PATH and SCREEN_INDEX_FILE_PATH point to the input files.
    2. GENE_INFO_FILE_PATH points to the gene info file that contains the unique gene ID that will be attached to the output file.
    3. OUTPUT_GENE_FILE_PATH and OUTPUT_STUDY_FILE_PATH are the name of the output files.
  5. Run processBioGridORCS.R
  6. Update version, release and access dates of BioGRID ORCS in ../mavequest-importer/databaseVersions.json

Update Priority Genes

  1. Go to prioritization folder
  2. Open fetchACMGList.R
  3. Make sure the Global Variables are up-to-date
    1. GENE_INFO_FILE_PATH points to the gene info file that contains the unique gene ID that will be attached to the output file.
    2. OUTPUT_FILE_PATH is the name of the output file.
  4. Run fetchACMGList.R
  5. Open processDAISList.R
  6. Make sure the Global Variables are up-to-date
    1. GENE_INFO_FILE_PATH points to the gene info file that contains the unique gene ID that will be attached to the output file.
    2. OUTPUT_FILE_PATH is the name of the output file.
  7. Run processDAISList.R
Clone this wiki locally