-
Notifications
You must be signed in to change notification settings - Fork 0
[MaveQuest] Update Source Data
- Repository: https://github.com/rothlab/mavequest-datasources
- Check out the
mavequest-datasourcesand themavequest-importerrepositories
gh repo clone rothlab/mavequest-datasources
gh repo clone rothlab/mavequest-importer- Set
mavequest-datasourcesas the working directory
-
Go to
geneInfofolder -
Prepare the HGNC gene set. Download the current release from http://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/tsv/locus_types/gene_with_protein_product.txt.
-
Open
parseHGNC.R -
Make sure the Global Variables are up-to-date (access date)
-
HGNC_DATABASE_FILE_PATHpoints to the HGNC complete gene set that was downloaded in step 2. -
OUTPUT_FILE_PATHis the name of the output file. -
CACHED_CANONICAL_FILE_PATHSpoints to the cached canonical isoforms from Ensembl and Uniprot databases.- We recommend against using the cached isoform files. Each time you update this data source, you should use fresh canoncial isoforms. However, because it takes ~2 hours to query both databases, if you had to debug this step, it is faster to use the cached the isoform files.
- If you do not want to use cached file for a certain database, set the corresponding element to NA. For example, if you do not want to cache ensembl results, set the variable to:
CACHED_CANONICAL_FILE_PATHS = c("ensembl" = NA, "uniprot” = “cached_file.rds”)
-
-
Run
parseHGNC.R -
Update the version and access date of HGNC in
../mavequest-importer/databaseVersions.json
- Go to
ambrygenfolder - Open
scrapeAmbry.R - Make sure the Global Variables are up-to-date
-
GENE_INFO_FILE_PATHpoints to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATHis the name of the output file.
-
- Run
scrapeAmbry.R - Update the access date of Ambry Test Catalog in
../mavequest-importer/databaseVersions.json
-
Go to
cancer_censusfolder -
Prepare the Cancer Gene Census dataset. Download the current release from https://cancer.sanger.ac.uk/census using the "Export CSV" function (image below). ::You will need a COSMIC account to download data.::

-
Open
parseCancerCensus.R -
Make sure the Global Variables are up-to-date
-
CANCER_CENSUS_FILE_PATHpoints to the Cancer Census dataset that was downloaded in step 2. -
GENE_INFO_FILE_PATHpoints to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATHis the name of the output file.
-
-
Run
parseCancerCensus.R -
Update the version, release date and access date of Cancer Gene Census in
../mavequest-importer/databaseVersions.json-
You can find the version and release date on the COSMIC front page: https://cancer.sanger.ac.uk/cosmic

-
- Go to
clinvarfolder - Prepare the clinker variant set. Download and unzip the current release from https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz.
- Open
parseClinvar.R - Make sure the Global Variables are up-to-date
-
CLINVAR_FILE_PATHpoints to the ClinVar dataset that was downloaded in step 2. -
GENE_INFO_FILE_PATHpoints to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATHis the name of the output file.
-
- Run
parseClinvar.R - Update the version and access date of ClinVar in
../mavequest-importer/databaseVersions.json- You can find the version (for ClinVar, it’s the month and year of the release) in the release notes folder: https://ftp.ncbi.nlm.nih.gov/pub/clinvar/release_notes/
- Go to
genedxfolder - Open
scrapeGeneDx.R - Make sure the Global Variables are up-to-date
-
GENE_INFO_FILE_PATHpoints to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATHis the name of the output file.
-
- Run
scrapeGeneDx.R - Update the access date of GeneDx Test Catalog in
../mavequest-importer/databaseVersions.json
As of Jan 2022, GenomeCRISPR is not updated by the maintainer. The download link is broken. We will be using a previously downloaded version. ::We still need to run the update script to match genes with new indices (generated from the Update Gene Info).
Please still check the GenomeCRISPR website to see if there's a new version: http://genomecrispr.dkfz.de/
If no new versions, we do not need to update
databaseVersions.json
- Go to
genome_crisprfolder - Open
parseGenomeCRISPRDB.R - Make sure the Global Variables are up-to-date
-
GENOMECRISPR_FULL_FILE_PATHpoints to the full GenomeCRISPR database that was released in May 2017. -
GENOMECRISPR_ADDITIONAL_FILE_PATHpoints to the additional dataset that was shared to us by the authors in Sept 2019. -
GENE_INFO_FILE_PATHpoints to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_HIT_FILE_PATHandOUTPUT_HITSUM_FILE_PATHare the name of the output files.
-
- Run
parseGenomeCRISPRDB.R
As of Jan 2022, GenomeRNAi is not updated by the maintainer. The download link is broken. We will be using a previously downloaded version. ::We still need to run the update script to match genes with new indices (generated from the Update Gene Info).
Please still check the GenomeRNAi website to see if there's a new version: http://www.genomernai.org/
If no new versions, we do not need to update
databaseVersions.json
- Go to
genome_rnaifolder - Open
parseGenomeRNAi.R - Make sure the Global Variables are up-to-date
-
RNAi_FILE_PATHpoints to the full GenomeRNAi database that was released in May 2017. -
GENE_INFO_FILE_PATHpoints to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATHis the name of the output files.
-
- Run
parseGenomeRNAi.R
We are using human protein-protein interactions from the Human Interactome Project (HIP, http://www.interactome-atlas.org/). The only exception is the literature curated set Lit-BM. Because the Lit-BM data on the HIP website do not include essential metadata (e.g. type of interaction, source, discovery method), we used a Lit-BM file (Lit-BM-17) that was provided to us by HIP maintainers.
As the HuRI dataset has been released in 2021, the maintainers have not added any new interactions. Please still check the HuRI website to see if there’s new data released: http://www.interactome-atlas.org/
If no new data, we do not need to update
databaseVersions.json
- Go to
hurifolder - Prepare the HIP variant set. Download HuRI.psi and HI-union.psi from http://www.interactome-atlas.org/download.
- Open
parseHuRI.R - Make sure the Global Variables are up-to-date
-
HORF71_FILE_PATHandHORF81_FILE_PATHpoint to the human ORFeome datasets which are required to map ORF IDs to Gene Symbol for some datasets (e.g. HuRI.psi). -
MISSING_ORF_IDS_FILE_PATHandMISSING_GENE_SYMBOLS_FILE_PATHpoint to manually mapped missing ORF IDs and Gene Symbols. These files help to map ORF IDs and Gene Symbols that cannot be mapped automatically. -
HURI_FILE_PATH,HI_UNION_FILE_PATH,LITMB_FILE_PATHpoint to HIP dataset downloaded in step 2. -
PUBMED_API_ACCESS_KEYset to the API access key from PubMed. The API key is required to query NCBI API. See documentation here: https://support.nlm.nih.gov/knowledgebase/article/KA-05317/en-us -
GENE_INFO_FILE_PATHpoints to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATHis the name of the output files.
-
- Run
parseHuRI.R
- Go to
interprofolder - Open
getInterpro.R - Make sure the Global Variables are up-to-date
-
API_GENEandAPI_ENTRY_INFOpoint to the InterPro API endpoints. -
GENE_INFO_FILE_PATHpoints to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATHis the name of the output files.
-
- Run
getInterpro.R - Update version, release and access dates of InterPro in
../mavequest-importer/databaseVersions.json
- Go to
invitaefolder - Open
scrapeInvitae.R - Make sure the Global Variables are up-to-date
-
INVITAE_CATALOGS,INVITAE_PREFIXandINVITAE_TEST_PREFIXpoint to Invitae website pages that we will crawl. -
GENE_INFO_FILE_PATHpoints to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATHis the name of the output files.
-
- Run
scrapeInvitae.R - Update the access date of Invitae Testing Catalog in
../mavequest-importer/databaseVersions.json
- Go to
mavedbfolder - Open
curateMaveDB.R - Make sure the Global Variables are up-to-date
-
API_MAVEDBpoints to the MaveDB API. -
GENE_INFO_FILE_PATHpoints to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATHis the name of the output files.
-
- Run
curateMaveDB.R - Update the version, release date and access dates of MaveDB in
../mavequest-importer/databaseVersions.json
As of Jan 2022, the most recent version of OGEE (version 3) no longer provides a breakdown of human essential genes with associated studies. However, because version 3 does not include any new human essentiality studies, we can simply use the cached version 2 OGEE data dump.
You should still check the OGEE website in case new data is released: https://v3.ogee.info/#/home
If no new data, no need to update
databaseVersions.json
- Go to
ogeefolder - Open
parseOGEE.R - Make sure the Global Variables are up-to-date
-
OGEE_GENES_FILE_PATHandOGEE_STUDIES_FILE_PATHpoint to the OGEE data dump. -
GENE_INFO_FILE_PATHpoints to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_GENES_FILE_PATHandOUTPUT_STUDIES_FILE_PATHpoint to the output files.
-
- Run
parseOGEE.R
As of Jan 2022, OMIM requires an account to download data. Make sure you submit your data request at https://omim.org/downloads. Once your data request is approved, you will receive an email from OMIM with the link to download the dataset (genemap2.txt).
- Go to
omimfolder - Prepare the OMIM dataset by downloading
genemap2.txtusing the personalized link sent to you from OMIM team. - Open
parseOMIM.R - Make sure the Global Variables are up-to-date
-
OMIM_FILE_PATHpoints to the OMIM dataset downloaded in Step 2. -
GENE_INFO_FILE_PATHpoints to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATHis the name of the output files.
-
- Run
parseOMIM.R - Update the version, release date and access dates of OMIM in
../mavequest-importer/databaseVersions.json
- Go to
orphanetfolder - Prepare the Orphanet rare disease set. Download:
- genes associated with rare diseases (en_product6.xml) from http://www.orphadata.org/data/xml/en_product6.xml.
- rare disease prevalence (en_product9_prev.xml) from http://www.orphadata.org/data/xml/en_product9_prev.xml.
- Open
parseOrphanet.R - Make sure the Global Variables are up-to-date
-
ORPHANET_GENES_FILE_PATHandORPHANET_DISEASES_FILE_PATHpoint to the Orphanet dataset downloaded in Step 2. -
GENE_INFO_FILE_PATHpoints to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATHis the name of the output files.
-
- Run
parseOrphanet.R - Update the version, release date and access dates of Orphanet in
../mavequest-importer/databaseVersions.json
- Go to
orthology/agrfolder - Prepare AGR dataset. Download:
-
Human gene descriptions (tsv format) from https://www.alliancegenome.org/downloads#gene-descriptions.

-
Orthology (tsv format) from https://www.alliancegenome.org/downloads#orthology.
-
- Open
parseAGR.R - Make sure the Global Variables are up-to-date
-
AGR_GENE_DESCRIPTION_FILE_PATHandAGR_ORTHOLOGY_FILE_PATHpoint to the AGR dataset downloaded in Step 2. -
GENE_INFO_FILE_PATHpoints to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATHis the name of the output files.
-
- Run
parseAGR.R - Update the version, release date and access date of AGR in
../mavequest-importer/databaseVersions.json
As of Jan 2022, the InParanoid is no longer updated. As a result, we will not need to download new data. You should still check the InParanoid website in case new data is released: https://inparanoid.sbc.su.se/cgi-bin/index.cgi If no new data, no need to update
databaseVersions.json
- Go to
orthology/inparanoidfolder - Open
parseInparanoid.R - Make sure the Global Variables are up-to-date
-
INPARANOID_SCEREVISIAE_FILE_PATHandINPARANOID_SPOMBE_FILE_PATHpoint to the Inparanoid dataset. -
GENE_INFO_FILE_PATHpoints to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATHis the name of the output files.
-
- Run
parseInparanoid.R
As of Jan 2022, we add homologs from two papers: Yang et al., 2017 and Hamza et al., 2020. Two other papers (Kachroo et al., 2015 and Sun et al., 2016) were considered but not added because they are already in SGD. Note, please make sure you install OpenJDK and point the
JAVA_HOMEenvironment variable to the JDK directory.
- Go to
orthology/papersfolder - Open
parseCompFromPapers.R - Make sure the Global Variables are up-to-date
-
PAPERS_FILE_PATHSpoints to homologs from papers. -
GENE_INFO_FILE_PATHpoints to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATHis the name of the output files.
-
- Run
parseCompFromPapers.R
- Go to
orthology/pombasefolder - Prepare the PomBase ortholog set. Download and unzip the current release from https://www.pombase.org/data/orthologs/human-orthologs.txt.gz.
- Open
parsePombase.R - Make sure the Global Variables are up-to-date
-
POMBASE_FILE_PATHpoints to the PomBase dataset that was downloaded in step 2. -
API_POMBASEpoints to the PomBase API endpoint. -
GENE_INFO_FILE_PATHpoints to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATHis the name of the output file.
-
- Update release and access dates of AGR in
../mavequest-importer/databaseVersions.json
As of Jan 2022, the P-POD is no longer updated. As a result, we will not need to download new data. You should still check the InParanoid website in case new data is released: http://ppod.princeton.edu/ If no new data, no need to update
databaseVersions.json
- Go to
orthology/ppodfolder - Open
parsePpod.R - Make sure the Global Variables are up-to-date
-
PPOD_FILE_PATHpoints to the P-POD dataset. -
GENE_INFO_FILE_PATHpoints to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATHis the name of the output files.
-
- Run
parsePpod.R
- Go to
orthology/sgdfolder - Prepare the SGD ortholog set. Download and unzip the current release from http://sgd-archive.yeastgenome.org/curation/literature/functional_complementation.tab.
- Open
parseSGD.R - Make sure the Global Variables are up-to-date
-
SGD_FILE_PATHpoints to the P-POD dataset. -
GENE_INFO_FILE_PATHpoints to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATHis the name of the output files.
-
- Run
parsePpod.R
- Go to
orthologyfolder - Open
mergeOrthology.R - Make sure the Global Variables are up-to-date
-
ORTHO_FILE_PATHspoint to homology data sources. -
OUTPUT_FILE_PATHis the name of the output files.
-
- Run
mergeOrthology.R
As of Jan 2022, we add homologs from five papers: Liu et al., 2007, Gilbert et al., 2014, Konermann et al., 2015, Horlbeck et al., 2016, Duffy et al., 2016.
- Go to
overexpressionfolder - Open
parseOverexpression.R - Make sure the Global Variables are up-to-date
-
OVEREXPRESSION_FILE_PATHpoints to the over-expression dataset. -
GENE_INFO_FILE_PATHpoints to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATHis the name of the output files.
-
- Run
parseOverexpression.R
- Go to
pharmgkbfolder - Prepare PharmGKB dataset. Download:
- Variant annotation summary from https://api.pharmgkb.org/v1/download/file/data/variantAnnotations.zip. Unzip and copy the
var_pheno_ann.tsvfile to the folder. - Clinical variant data from https://api.pharmgkb.org/v1/download/file/data/clinicalVariants.zip. Unzip and copy the
clinicalVariants.tsvfile to the folder. - Drug label annotations from https://api.pharmgkb.org/v1/download/file/data/drugLabels.zip. Unzip and copy the
drugLabels.byGene.tsvfile to the folder.
- Variant annotation summary from https://api.pharmgkb.org/v1/download/file/data/variantAnnotations.zip. Unzip and copy the
- Open
parsePharmGKB.R - Make sure the Global Variables are up-to-date
-
PGKB_VAR_ANNOTATIONS_FILE_PATH,PGKB_CLIN_VARIANTS_FILE_PATHandPGKB_DRUG_LABELS_FILE_PATHpoint to the PharmGKB dataset downloaded in Step 2. -
GENE_INFO_FILE_PATHpoints to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATHis the name of the output files.
-
- Run
parsePharmGKB.R - Update release and access dates of AGR in
../mavequest-importer/databaseVersions.json
- Go to
secondary_structurefolder - Open
getSecondaryStructureFromUniprot.R - Make sure the Global Variables are up-to-date
-
API_ENDPOINTpoints to the Uniprot API. -
GENE_INFO_FILE_PATHpoints to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_DIR_PATHandOUTPUT_FILE_PATHare the name of the output files.
-
- Run
getSecondaryStructureFromUniprot.R - Update version, release and access dates of UniProt in
../mavequest-importer/databaseVersions.json
- Go to
biogrid_orcsfolder - Download the latest BioGRID ORCS data (BIOGRID-ORCS-ALL-homo_sapiens-LATEST.screens.tar.gz) from: https://downloads.thebiogrid.org/File/BioGRID-ORCS/Latest-Release/BIOGRID-ORCS-ALL-homo_sapiens-LATEST.screens.tar.gz. Unzip and copy the screen folder to the working folder.
- Open
processBioGridORCS.R - Make sure the Global Variables are up-to-date
-
INPUT_DIR_PATHandSCREEN_INDEX_FILE_PATHpoint to the input files. -
GENE_INFO_FILE_PATHpoints to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_GENE_FILE_PATHandOUTPUT_STUDY_FILE_PATHare the name of the output files.
-
- Run
processBioGridORCS.R - Update version, release and access dates of BioGRID ORCS in
../mavequest-importer/databaseVersions.json
- Go to
prioritizationfolder - Open
fetchACMGList.R - Make sure the Global Variables are up-to-date
-
GENE_INFO_FILE_PATHpoints to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATHis the name of the output file.
-
- Run
fetchACMGList.R - Open
processDAISList.R - Make sure the Global Variables are up-to-date
-
GENE_INFO_FILE_PATHpoints to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATHis the name of the output file.
-
- Run
processDAISList.R