-
Notifications
You must be signed in to change notification settings - Fork 0
[MaveQuest] Update Source Data
- Repository: https://github.com/rothlab/mavequest-datasources
- Check out the
mavequest-datasources
and themavequest-importer
repositories
gh repo clone rothlab/mavequest-datasources
gh repo clone rothlab/mavequest-importer
- Set
mavequest-datasources
as the working directory
-
Go to
geneInfo
folder -
Prepare the HGNC gene set. Download the current release from http://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/tsv/locus_types/gene_with_protein_product.txt.
-
Open
parseHGNC.R
-
Make sure the Global Variables are up-to-date (access date)
-
HGNC_DATABASE_FILE_PATH
points to the HGNC complete gene set that was downloaded in step 2. -
OUTPUT_FILE_PATH
is the name of the output file. -
CACHED_CANONICAL_FILE_PATHS
points to the cached canonical isoforms from Ensembl and Uniprot databases.- We recommend against using the cached isoform files. Each time you update this data source, you should use fresh canoncial isoforms. However, because it takes ~2 hours to query both databases, if you had to debug this step, it is faster to use the cached the isoform files.
- If you do not want to use cached file for a certain database, set the corresponding element to NA. For example, if you do not want to cache ensembl results, set the variable to:
CACHED_CANONICAL_FILE_PATHS = c("ensembl" = NA, "uniprot” = “cached_file.rds”)
-
-
Run
parseHGNC.R
-
Update the version and access date of HGNC in
../mavequest-importer/databaseVersions.json
- Go to
ambrygen
folder - Open
scrapeAmbry.R
- Make sure the Global Variables are up-to-date
-
GENE_INFO_FILE_PATH
points to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATH
is the name of the output file.
-
- Run
scrapeAmbry.R
- Update the access date of Ambry Test Catalog in
../mavequest-importer/databaseVersions.json
-
Go to
cancer_census
folder -
Prepare the Cancer Gene Census dataset. Download the current release from https://cancer.sanger.ac.uk/census using the "Export CSV" function (image below). ::You will need a COSMIC account to download data.::
-
Open
parseCancerCensus.R
-
Make sure the Global Variables are up-to-date
-
CANCER_CENSUS_FILE_PATH
points to the Cancer Census dataset that was downloaded in step 2. -
GENE_INFO_FILE_PATH
points to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATH
is the name of the output file.
-
-
Run
parseCancerCensus.R
-
Update the version, release date and access date of Cancer Gene Census in
../mavequest-importer/databaseVersions.json
-
You can find the version and release date on the COSMIC front page: https://cancer.sanger.ac.uk/cosmic
-
- Go to
clinvar
folder - Prepare the clinker variant set. Download and unzip the current release from https://ftp.ncbi.nlm.nih.gov/pub/clinvar/tab_delimited/variant_summary.txt.gz.
- Open
parseClinvar.R
- Make sure the Global Variables are up-to-date
-
CLINVAR_FILE_PATH
points to the ClinVar dataset that was downloaded in step 2. -
GENE_INFO_FILE_PATH
points to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATH
is the name of the output file.
-
- Run
parseClinvar.R
- Update the version and access date of ClinVar in
../mavequest-importer/databaseVersions.json
- You can find the version (for ClinVar, it’s the month and year of the release) in the release notes folder: https://ftp.ncbi.nlm.nih.gov/pub/clinvar/release_notes/
- Go to
genedx
folder - Open
scrapeGeneDx.R
- Make sure the Global Variables are up-to-date
-
GENE_INFO_FILE_PATH
points to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATH
is the name of the output file.
-
- Run
scrapeGeneDx.R
- Update the access date of GeneDx Test Catalog in
../mavequest-importer/databaseVersions.json
As of Jan 2022, GenomeCRISPR is not updated by the maintainer. The download link is broken. We will be using a previously downloaded version. ::We still need to run the update script to match genes with new indices (generated from the Update Gene Info).
Please still check the GenomeCRISPR website to see if there's a new version: http://genomecrispr.dkfz.de/
If no new versions, we do not need to update
databaseVersions.json
- Go to
genome_crispr
folder - Open
parseGenomeCRISPRDB.R
- Make sure the Global Variables are up-to-date
-
GENOMECRISPR_FULL_FILE_PATH
points to the full GenomeCRISPR database that was released in May 2017. -
GENOMECRISPR_ADDITIONAL_FILE_PATH
points to the additional dataset that was shared to us by the authors in Sept 2019. -
GENE_INFO_FILE_PATH
points to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_HIT_FILE_PATH
andOUTPUT_HITSUM_FILE_PATH
are the name of the output files.
-
- Run
parseGenomeCRISPRDB.R
As of Jan 2022, GenomeRNAi is not updated by the maintainer. The download link is broken. We will be using a previously downloaded version. ::We still need to run the update script to match genes with new indices (generated from the Update Gene Info).
Please still check the GenomeRNAi website to see if there's a new version: http://www.genomernai.org/
If no new versions, we do not need to update
databaseVersions.json
- Go to
genome_rnai
folder - Open
parseGenomeRNAi.R
- Make sure the Global Variables are up-to-date
-
RNAi_FILE_PATH
points to the full GenomeRNAi database that was released in May 2017. -
GENE_INFO_FILE_PATH
points to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATH
is the name of the output files.
-
- Run
parseGenomeRNAi.R
We are using human protein-protein interactions from the Human Interactome Project (HIP, http://www.interactome-atlas.org/). The only exception is the literature curated set Lit-BM. Because the Lit-BM data on the HIP website do not include essential metadata (e.g. type of interaction, source, discovery method), we used a Lit-BM file (Lit-BM-17) that was provided to us by HIP maintainers.
As the HuRI dataset has been released in 2021, the maintainers have not added any new interactions. Please still check the HuRI website to see if there’s new data released: http://www.interactome-atlas.org/
If no new data, we do not need to update
databaseVersions.json
- Go to
huri
folder - Prepare the HIP variant set. Download HuRI.psi and HI-union.psi from http://www.interactome-atlas.org/download.
- Open
parseHuRI.R
- Make sure the Global Variables are up-to-date
-
HORF71_FILE_PATH
andHORF81_FILE_PATH
point to the human ORFeome datasets which are required to map ORF IDs to Gene Symbol for some datasets (e.g. HuRI.psi). -
MISSING_ORF_IDS_FILE_PATH
andMISSING_GENE_SYMBOLS_FILE_PATH
point to manually mapped missing ORF IDs and Gene Symbols. These files help to map ORF IDs and Gene Symbols that cannot be mapped automatically. -
HURI_FILE_PATH
,HI_UNION_FILE_PATH
,LITMB_FILE_PATH
point to HIP dataset downloaded in step 2. -
PUBMED_API_ACCESS_KEY
set to the API access key from PubMed. The API key is required to query NCBI API. See documentation here: https://support.nlm.nih.gov/knowledgebase/article/KA-05317/en-us -
GENE_INFO_FILE_PATH
points to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATH
is the name of the output files.
-
- Run
parseHuRI.R
- Go to
interpro
folder - Open
getInterpro.R
- Make sure the Global Variables are up-to-date
-
API_GENE
andAPI_ENTRY_INFO
point to the InterPro API endpoints. -
GENE_INFO_FILE_PATH
points to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATH
is the name of the output files.
-
- Run
getInterpro.R
- Update version, release and access dates of InterPro in
../mavequest-importer/databaseVersions.json
- Go to
invitae
folder - Open
scrapeInvitae.R
- Make sure the Global Variables are up-to-date
-
INVITAE_CATALOGS
,INVITAE_PREFIX
andINVITAE_TEST_PREFIX
point to Invitae website pages that we will crawl. -
GENE_INFO_FILE_PATH
points to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATH
is the name of the output files.
-
- Run
scrapeInvitae.R
- Update the access date of Invitae Testing Catalog in
../mavequest-importer/databaseVersions.json
- Go to
mavedb
folder - Open
curateMaveDB.R
- Make sure the Global Variables are up-to-date
-
API_MAVEDB
points to the MaveDB API. -
GENE_INFO_FILE_PATH
points to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATH
is the name of the output files.
-
- Run
curateMaveDB.R
- Update the version, release date and access dates of MaveDB in
../mavequest-importer/databaseVersions.json
As of Jan 2022, the most recent version of OGEE (version 3) no longer provides a breakdown of human essential genes with associated studies. However, because version 3 does not include any new human essentiality studies, we can simply use the cached version 2 OGEE data dump.
You should still check the OGEE website in case new data is released: https://v3.ogee.info/#/home
If no new data, no need to update
databaseVersions.json
- Go to
ogee
folder - Open
parseOGEE.R
- Make sure the Global Variables are up-to-date
-
OGEE_GENES_FILE_PATH
andOGEE_STUDIES_FILE_PATH
point to the OGEE data dump. -
GENE_INFO_FILE_PATH
points to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_GENES_FILE_PATH
andOUTPUT_STUDIES_FILE_PATH
point to the output files.
-
- Run
parseOGEE.R
As of Jan 2022, OMIM requires an account to download data. Make sure you submit your data request at https://omim.org/downloads. Once your data request is approved, you will receive an email from OMIM with the link to download the dataset (genemap2.txt).
- Go to
omim
folder - Prepare the OMIM dataset by downloading
genemap2.txt
using the personalized link sent to you from OMIM team. - Open
parseOMIM.R
- Make sure the Global Variables are up-to-date
-
OMIM_FILE_PATH
points to the OMIM dataset downloaded in Step 2. -
GENE_INFO_FILE_PATH
points to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATH
is the name of the output files.
-
- Run
parseOMIM.R
- Update the version, release date and access dates of OMIM in
../mavequest-importer/databaseVersions.json
- Go to
orphanet
folder - Prepare the Orphanet rare disease set. Download:
- genes associated with rare diseases (en_product6.xml) from http://www.orphadata.org/data/xml/en_product6.xml.
- rare disease prevalence (en_product9_prev.xml) from http://www.orphadata.org/data/xml/en_product9_prev.xml.
- Open
parseOrphanet.R
- Make sure the Global Variables are up-to-date
-
ORPHANET_GENES_FILE_PATH
andORPHANET_DISEASES_FILE_PATH
point to the Orphanet dataset downloaded in Step 2. -
GENE_INFO_FILE_PATH
points to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATH
is the name of the output files.
-
- Run
parseOrphanet.R
- Update the version, release date and access dates of Orphanet in
../mavequest-importer/databaseVersions.json
- Go to
orthology/agr
folder - Prepare AGR dataset. Download:
-
Human gene descriptions (tsv format) from https://www.alliancegenome.org/downloads#gene-descriptions.
-
Orthology (tsv format) from https://www.alliancegenome.org/downloads#orthology.
-
- Open
parseAGR.R
- Make sure the Global Variables are up-to-date
-
AGR_GENE_DESCRIPTION_FILE_PATH
andAGR_ORTHOLOGY_FILE_PATH
point to the AGR dataset downloaded in Step 2. -
GENE_INFO_FILE_PATH
points to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATH
is the name of the output files.
-
- Run
parseAGR.R
- Update the version, release date and access date of AGR in
../mavequest-importer/databaseVersions.json
As of Jan 2022, the InParanoid is no longer updated. As a result, we will not need to download new data. You should still check the InParanoid website in case new data is released: https://inparanoid.sbc.su.se/cgi-bin/index.cgi If no new data, no need to update
databaseVersions.json
- Go to
orthology/inparanoid
folder - Open
parseInparanoid.R
- Make sure the Global Variables are up-to-date
-
INPARANOID_SCEREVISIAE_FILE_PATH
andINPARANOID_SPOMBE_FILE_PATH
point to the Inparanoid dataset. -
GENE_INFO_FILE_PATH
points to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATH
is the name of the output files.
-
- Run
parseInparanoid.R
As of Jan 2022, we add homologs from two papers: Yang et al., 2017 and Hamza et al., 2020. Two other papers (Kachroo et al., 2015 and Sun et al., 2016) were considered but not added because they are already in SGD. Note, please make sure you install OpenJDK and point the
JAVA_HOME
environment variable to the JDK directory.
- Go to
orthology/papers
folder - Open
parseCompFromPapers.R
- Make sure the Global Variables are up-to-date
-
PAPERS_FILE_PATHS
points to homologs from papers. -
GENE_INFO_FILE_PATH
points to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATH
is the name of the output files.
-
- Run
parseCompFromPapers.R
- Go to
orthology/pombase
folder - Prepare the PomBase ortholog set. Download and unzip the current release from https://www.pombase.org/data/orthologs/human-orthologs.txt.gz.
- Open
parsePombase.R
- Make sure the Global Variables are up-to-date
-
POMBASE_FILE_PATH
points to the PomBase dataset that was downloaded in step 2. -
API_POMBASE
points to the PomBase API endpoint. -
GENE_INFO_FILE_PATH
points to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATH
is the name of the output file.
-
- Update release and access dates of AGR in
../mavequest-importer/databaseVersions.json
As of Jan 2022, the P-POD is no longer updated. As a result, we will not need to download new data. You should still check the InParanoid website in case new data is released: http://ppod.princeton.edu/ If no new data, no need to update
databaseVersions.json
- Go to
orthology/ppod
folder - Open
parsePpod.R
- Make sure the Global Variables are up-to-date
-
PPOD_FILE_PATH
points to the P-POD dataset. -
GENE_INFO_FILE_PATH
points to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATH
is the name of the output files.
-
- Run
parsePpod.R
- Go to
orthology/sgd
folder - Prepare the SGD ortholog set. Download and unzip the current release from http://sgd-archive.yeastgenome.org/curation/literature/functional_complementation.tab.
- Open
parseSGD.R
- Make sure the Global Variables are up-to-date
-
SGD_FILE_PATH
points to the P-POD dataset. -
GENE_INFO_FILE_PATH
points to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATH
is the name of the output files.
-
- Run
parsePpod.R
- Go to
orthology
folder - Open
mergeOrthology.R
- Make sure the Global Variables are up-to-date
-
ORTHO_FILE_PATHs
point to homology data sources. -
OUTPUT_FILE_PATH
is the name of the output files.
-
- Run
mergeOrthology.R
As of Jan 2022, we add homologs from five papers: Liu et al., 2007, Gilbert et al., 2014, Konermann et al., 2015, Horlbeck et al., 2016, Duffy et al., 2016.
- Go to
overexpression
folder - Open
parseOverexpression.R
- Make sure the Global Variables are up-to-date
-
OVEREXPRESSION_FILE_PATH
points to the over-expression dataset. -
GENE_INFO_FILE_PATH
points to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATH
is the name of the output files.
-
- Run
parseOverexpression.R
- Go to
pharmgkb
folder - Prepare PharmGKB dataset. Download:
- Variant annotation summary from https://api.pharmgkb.org/v1/download/file/data/variantAnnotations.zip. Unzip and copy the
var_pheno_ann.tsv
file to the folder. - Clinical variant data from https://api.pharmgkb.org/v1/download/file/data/clinicalVariants.zip. Unzip and copy the
clinicalVariants.tsv
file to the folder. - Drug label annotations from https://api.pharmgkb.org/v1/download/file/data/drugLabels.zip. Unzip and copy the
drugLabels.byGene.tsv
file to the folder.
- Variant annotation summary from https://api.pharmgkb.org/v1/download/file/data/variantAnnotations.zip. Unzip and copy the
- Open
parsePharmGKB.R
- Make sure the Global Variables are up-to-date
-
PGKB_VAR_ANNOTATIONS_FILE_PATH
,PGKB_CLIN_VARIANTS_FILE_PATH
andPGKB_DRUG_LABELS_FILE_PATH
point to the PharmGKB dataset downloaded in Step 2. -
GENE_INFO_FILE_PATH
points to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATH
is the name of the output files.
-
- Run
parsePharmGKB.R
- Update release and access dates of AGR in
../mavequest-importer/databaseVersions.json
- Go to
secondary_structure
folder - Open
getSecondaryStructureFromUniprot.R
- Make sure the Global Variables are up-to-date
-
API_ENDPOINT
points to the Uniprot API. -
GENE_INFO_FILE_PATH
points to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_DIR_PATH
andOUTPUT_FILE_PATH
are the name of the output files.
-
- Run
getSecondaryStructureFromUniprot.R
- Update version, release and access dates of UniProt in
../mavequest-importer/databaseVersions.json
- Go to
biogrid_orcs
folder - Download the latest BioGRID ORCS data (BIOGRID-ORCS-ALL-homo_sapiens-LATEST.screens.tar.gz) from: https://downloads.thebiogrid.org/File/BioGRID-ORCS/Latest-Release/BIOGRID-ORCS-ALL-homo_sapiens-LATEST.screens.tar.gz. Unzip and copy the screen folder to the working folder.
- Open
processBioGridORCS.R
- Make sure the Global Variables are up-to-date
-
INPUT_DIR_PATH
andSCREEN_INDEX_FILE_PATH
point to the input files. -
GENE_INFO_FILE_PATH
points to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_GENE_FILE_PATH
andOUTPUT_STUDY_FILE_PATH
are the name of the output files.
-
- Run
processBioGridORCS.R
- Update version, release and access dates of BioGRID ORCS in
../mavequest-importer/databaseVersions.json
- Go to
prioritization
folder - Open
fetchACMGList.R
- Make sure the Global Variables are up-to-date
-
GENE_INFO_FILE_PATH
points to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATH
is the name of the output file.
-
- Run
fetchACMGList.R
- Open
processDAISList.R
- Make sure the Global Variables are up-to-date
-
GENE_INFO_FILE_PATH
points to the gene info file that contains the unique gene ID that will be attached to the output file. -
OUTPUT_FILE_PATH
is the name of the output file.
-
- Run
processDAISList.R