Execute src/app/run.py
This app has been a project under the supervision of Dr. Nicolas Ruffini and can be considered as a side
project of his multi-omics meta analysis on neurodegenerative diseases (Ruffini et al., 2020, and
Ruffini et al. 2022).
The objective is to collect transcriptomics,
proteomics, and genomics data on neurodegenerative diseases from various studies and
databases, and to standardize the structure across all datasets to ensure uniformity and,
consequently, to facilitate further data analysis. The MIND NDD app serves as an interactive
tool for visualizing, searching, filtering and downloading that data.
Ruffini N, Klingenberg S, Heese R, Schweiger S, Gerber S. The Big Picture of Neurodegeneration: A Meta Study to Extract the Essential Evidence on Neurodegenerative Diseases in a Network-Based Approach. Front Aging Neurosci. 2022;14:866886. doi: 10.3389/fnagi.2022.866886. PMID: 35832065; PMCID: PMC9271745.
Ruffini N, Klingenberg S, Schweiger S, Gerber S. Common Factors in Neurodegeneration: A Meta-Study revealing Shared Patterns on a Multi-Omics Scale. Cells. 2020;9(12):2642. doi: 10.3390/cells9122642.
The following information is found on the "Information"-page of the App:
Please note that this code has not undergone a formal review. Errors may have occurred during data collection and processing.
All data can be directly retrieved from the NDDs.db of this repository.
All following steps on data handling (except data collection) were performed using Python 3.12, and, mainly, the pandas, sqlalchemy, and dash packages. Details can be found in the repository.
Online literature research was performed to obtain multi-omics data on neurodegenerative diseases. Hereby, only human samples were considered. Keywords such as (“Alzheimer”, ”Parkinson”, ”Huntington”) were used for searching appropriate studies. The data sources for the meta analysis of Ruffini 2020 were a starting point and were considered if the data was publicly available. Different search tools / databases were used:
The workflow of processing the data depends on the omics-level.
- Extract the meta data on the dataset / study
- Manual process, must be figured out from the dataset / study information provided along with the data
- Read the data
- Necessary attributes: gene symbol, uniprot accession id, p-value, fold change
- Drop data points if
- nan values for either gene symbol, p-value or fold change
- p-values < 0.05
- gene symbol cannot be found in the HGNC database
- If the gene symbol is an “alias” or a “previous” symbol, it is exchanged to the approved symbol. The database for this was manually downloaded and is not updated automatically.
- Look up the UBERON id of the sample tissue(s) on
- OLS website
- This could be done automatically, however, due to efficiency reasons, this way was preferred.
- Look up the CL id of the cell type(s) on
- OLS website
- Only needed for single cell / single nucleus sequencing experiments
- Only genes that are HGNC approved will end up in the transcriptomics table. In the genes table, only the genes of transcriptomics dataset will end in the genes table if the MyGenes packages finds an entry for that particular gene.
- Extract the meta data on the dataset / study
- Manual process, must be figured out from the dataset / study information provided along with the data
- Read the data
- Necessary attributes: gene symbol, uniprot accession id, p-value, fold change
- Wrangle data so that
- One protein per datapoint applies
- Some proteomics dataset have multiple genes / proteins listed for one data point. However, the data model of this database expects only one unique foreign key in the proteomics table that references one gene / protein only. The datapoint will then be duplicated for each gene-protein pair.
- Drop data points if
- nan values for either gene symbol, p-value or fold change
- p-values < 0.05
- gene symbol cannot be found in the HGNC database
- If the gene symbol is an “alias” or a “previous” symbol, it is exchanged to the approved symbol. The database for this was manually downloaded and is not updated automatically.
- If the base of the log fold change is not already 2, convert it (e.g. from log10 FC to log2 FC)
- Only genes that are HGNC approved will end up in the proteomics table. Out of those genes, only those for which the MyGenes package finds an entry will end up in the genes database table.
- All proteins within the dataset will end up in the proteomics table. The information the status of each protein is “reviewed”, ”not found”, “unreviewed” or “inactive” and whether the protein is an isoform can be found in the proteins table and can be used for customized filtering.
- The GWAS data for various neurodegenerative diseases is downloaded from the website.
- This is a manual process and not updated automatically.
- This is done as well for the Open Target Platform.
- This is a manual process and not updated automatically.
- The data are merged so that association scores to genomics data can be used for filtering.
- Drop data points if
- gene symbol cannot be found in the HGNC database
- If the gene symbol is an “alias” or a “previous” symbol, it is exchanged to the approved symbol. The database for this was manually downloaded and is not updated automatically.
- gene symbol cannot be found in the HGNC database
- Single Tissue eQTL
- All gene variants collected from genomics data are used to create GTEX request JSON files. The corresponding eQTL values for each tissue are then returned.
- Single Tissue sQTL
- All gene variants collected from genomics data are used to create GTEX request JSON files. The corresponding sQTL values for each tissue are then returned.
- For both tissues and cell types, ontology IDs (UBERON for tissues, CL for cell types) are requested, and the corresponding data is saved in the tissue and cell type tables.
