This repository includes code and a pre-trained model of scHiGex for single-cell gene expression prediction.
The code was tested on Python 3.10.4. The conda environment is shared via env/environment.yml, and for dnabert2, the environment is shared via env/environment_dnabert2.yml.
The dataset used for training is from the HiRES experiment. The dataset is available at https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE223917.
Files to be placed in the assets directory are as follows:
- assets/
- gencode.vM23.annotation.gtf
- mm10.fa
- mm10_100_segments_browser.bed
- rna_umicount.tsv - embryo
- metadata.xlsx - embryo
- pairs/ (Only for training)
To train the scHiGex model from scratch for mm10,
- Download and place the required files in the
assetsdirectory. - Run the python scripts inside the
scriptsdirectory in the order of the numbers prefixed to the file names. These scripts will generate the required data files for training the model. - Run
./train.shto train the model.
To predict gene expression levels using the trained model for mm10,
-
Download and place the required files in the
assetsdirectory (aparts from pairs files since there is no training involved). -
Run the following python scripts inside the
scriptsdirectory (Goal is to create chromosome definitions insidescriptsdirectory):1.1_run_gtfparse.py1.2_generate_metadata.py
-
Place the .pairs files in the
predictdirectory:- Group of Hi-C .pairs file that you want to predict gene expressions of inside the directory
predict/pairs/. At least 20 pairs files for each cell types are required to create the meta-cell. - example:
predict/pairs/cell_type_1/cell_type_1_1.pairscell_type_1_2.pairs- ...
cell_type_2/cell_type_2_1.pairscell_type_2_2.pairs- ...
- ...
- Group of Hi-C .pairs file that you want to predict gene expressions of inside the directory
-
Run
python 1.data_prep.pyto generate the required data files for prediction. -
Run
python 2.predict.pyto predict gene expression levels. -
The predicted gene expression levels will be saved in the
predictdirectory under the file namepredictions.csv
If you want to use your own trained model using scHiGex architecture, you need to point to right model file and node_embeddings.
The scripts were desinged to be compatible with the HiRES data for the experiment. The code can be easily modified to work according to the user's purpose.
Bishal Shrestha, Andrew Jordan Siciliano, Hao Zhu, Tong Liu, Zheng Wang, scHiGex: predicting single-cell gene expression based on single-cell Hi-C data, NAR Genomics and Bioinformatics, Volume 7, Issue 1, March 2025, lqaf002, 10.1093/nargab/lqaf002
