xVERSE (X-Verse) is a transcriptomics-native foundation model designed to learn robust, batch-invariant biological representations and synthesize high-fidelity virtual cells. By coupling representation learning with probabilistic gene expression generation, xVERSE enables advanced downstream applications in single-cell and spatial transcriptomics.
- Universal Representation Learning: Extract biological embeddings (
z_bio) that are robust to batch effects and noise. - Spatial Gene Imputation: Inaccurately impute unmeasured genes in spatial transcriptomics data using single-cell references.
- Virtual Cell Synthesis: Generate realistic, high-fidelity virtual cells to augment small datasets or serve as a data-augmentation engine.
-
Clone the repository:
git clone https://github.com/jichunxie/xVERSE_code.git cd xVERSE_code -
Create a virtual environment (Recommended):
conda create -n xverse python=3.9 conda activate xverse
-
Install dependencies:
pip install -r requirements.txt
-
Download Model Weights: Download the pretrained model weights (
xVERSE_384.pth) from our Hugging Face repository:After downloading, you can verify the path and pass it to the
--base_modelargument in the CLI. For example:# If you saved it to ./checkpoints/xVERSE_384.pth --base_model ./checkpoints/xVERSE_384.pth
xVERSE provides a unified CLI main.cli_xverse for all core tasks.
| Argument | Description | Required |
|---|---|---|
--input_dir |
Input directory or file path. | Yes |
--output_dir |
Output directory. | Yes |
--base_model |
Local path to the downloaded model checkpoint (xVERSE_384.pth). |
Yes |
--task |
embedding or generation (see Outputs below). |
Yes |
--tissue_name |
Tissue label (e.g., 'liver'). | Yes |
--mode |
0shot (Pretrained) or ft (Fine-tune). |
No (Default: 0shot) |
--gpu |
GPU ID (e.g., 0). |
No |
--num_samples_gen |
Number of Poisson samples to generate (Generation task only). | No (Default: 5) |
--epochs |
Number of fine-tuning epochs. | No (Default: 20) |
The script generates .h5ad files in the output_dir.
- File: Saves a copy of the input
.h5adto theoutput_dir. - Content:
adata.obsm['xVerse']: The biological embedding matrix (z_bio), size(n_cells, 384). Use this for clustering, UMAP, and integration.
- File: Creates a new file
*_mu_bio.h5adinoutput_dir. - Content:
adata.X: The denoised gene expression (mu_bio).adata.layers['mu_bio']: Same as X.adata.layers['sample_0'],sample_1...: Sparse count matrices sampled frommu_bio.- Genes: Strictly aligns with
ensg_keys_high_quality.txtorder.
Extract biological embeddings (z_bio) using the pretrained model directly.
python -m main.cli_xverse \
--input_dir ./data/liver_samples \
--output_dir ./results/embeddings \
--base_model /path/to/your/xVERSE_384.pth \
--tissue_name liver \
--mode 0shot \
--task embeddingPerform gene imputation or virtual cell synthesis using the pretrained model.
python -m main.cli_xverse \
--input_dir ./data/liver_samples \
--output_dir ./results/zeroshot_imputation \
--base_model /path/to/your/xVERSE_384.pth \
--tissue_name liver \
--mode 0shot \
--task generation \
--num_samples_gen 5Fine-tune on your specific dataset to generate denoised expression (mu_bio) or virtual cells.
python -m main.cli_xverse \
--input_dir ./data/liver_samples \
--output_dir ./results/imputation \
--base_model /path/to/your/xVERSE_384.pth \
--tissue_name liver \
--mode ft \
--task generation \
--num_samples_gen 5Adapt the model to your specific dataset (e.g., to handle strong batch effects) before extracting embeddings.
python -m main.cli_xverse \
--input_dir ./data/liver_samples \
--output_dir ./results/ft_embeddings \
--base_model /path/to/your/xVERSE_384.pth \
--tissue_name liver \
--mode ft \
--task embedding \
--epochs 20xVERSE_code/
├── main/ # Core xVERSE source code
│ ├── cli_xverse.py # Main CLI entry point
│ ├── utils_model.py # Model architecture definitions
│ └── ...
├── reproduce_manuscript/ # Scripts to reproduce paper figures
│ ├── fig1_overview/ # Figure 1: Model Overview
│ ├── fig2_biology_signal/ # Figure 2: Biological Signal & Benchmarking
│ ├── fig3_check_score_for_panel/ # Figure 3: Panel Analysis
│ ├── fig4_generate_single_cell/ # Figure 4: SC Generation
│ ├── fig5_imputation_spatial/ # Figure 5: Spatial Imputation
│ ├── fig6_small_sample/ # Figure 6: Small Sample Learning
│ └── fig7_cross_modality/ # Figure 7: Cross-Modality Prediction
├── bashfiles/ # HPC Slurm/Bash scripts
├── requirements.txt # Python dependencies
├── LICENSE # GNU GPL-3.0 License
└── README.md # Project documentation
This project is open source under the GNU General Public License v3.0 (GPL-3.0) - see the LICENSE file for details.
Note
Commercial Use: This software is free for non-commercial use. For commercial use, please contact the authors to obtain a separate license:
- Jichun Xie: jichun.xie@duke.edu
- Xiaohui Jiang: x.jiang@duke.edu
