Skip to content

jichunxie/xVERSE_code

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

146 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

xVERSE Logo

xVERSE: Transcriptomics-Native Foundation Model

License: GPL v3 Python 3.9+

xVERSE (X-Verse) is a transcriptomics-native foundation model designed to learn robust, batch-invariant biological representations and synthesize high-fidelity virtual cells. By coupling representation learning with probabilistic gene expression generation, xVERSE enables advanced downstream applications in single-cell and spatial transcriptomics.

🚀 Key Capabilities

  • Universal Representation Learning: Extract biological embeddings (z_bio) that are robust to batch effects and noise.
  • Spatial Gene Imputation: Inaccurately impute unmeasured genes in spatial transcriptomics data using single-cell references.
  • Virtual Cell Synthesis: Generate realistic, high-fidelity virtual cells to augment small datasets or serve as a data-augmentation engine.

📦 Installation

  1. Clone the repository:

    git clone https://github.com/jichunxie/xVERSE_code.git
    cd xVERSE_code
  2. Create a virtual environment (Recommended):

    conda create -n xverse python=3.9
    conda activate xverse
  3. Install dependencies:

    pip install -r requirements.txt
  4. Download Model Weights: Download the pretrained model weights (xVERSE_384.pth) from our Hugging Face repository:

    🤗 Xie-Lab/xVERSE

    After downloading, you can verify the path and pass it to the --base_model argument in the CLI. For example:

    # If you saved it to ./checkpoints/xVERSE_384.pth
    --base_model ./checkpoints/xVERSE_384.pth

🛠️ Usage

xVERSE provides a unified CLI main.cli_xverse for all core tasks.

1. Arguments

Argument Description Required
--input_dir Input directory or file path. Yes
--output_dir Output directory. Yes
--base_model Local path to the downloaded model checkpoint (xVERSE_384.pth). Yes
--task embedding or generation (see Outputs below). Yes
--tissue_name Tissue label (e.g., 'liver'). Yes
--mode 0shot (Pretrained) or ft (Fine-tune). No (Default: 0shot)
--gpu GPU ID (e.g., 0). No
--num_samples_gen Number of Poisson samples to generate (Generation task only). No (Default: 5)
--epochs Number of fine-tuning epochs. No (Default: 20)

2. Output Details

The script generates .h5ad files in the output_dir.

Task: embedding

  • File: Saves a copy of the input .h5ad to the output_dir.
  • Content:
    • adata.obsm['xVerse']: The biological embedding matrix (z_bio), size (n_cells, 384). Use this for clustering, UMAP, and integration.

Task: generation

  • File: Creates a new file *_mu_bio.h5ad in output_dir.
  • Content:
    • adata.X: The denoised gene expression (mu_bio).
    • adata.layers['mu_bio']: Same as X.
    • adata.layers['sample_0'], sample_1...: Sparse count matrices sampled from mu_bio.
    • Genes: Strictly aligns with ensg_keys_high_quality.txt order.

3. Examples

Scenario A: Zero-Shot Embedding Extraction

Extract biological embeddings (z_bio) using the pretrained model directly.

python -m main.cli_xverse \
    --input_dir ./data/liver_samples \
    --output_dir ./results/embeddings \
    --base_model /path/to/your/xVERSE_384.pth \
    --tissue_name liver \
    --mode 0shot \
    --task embedding

Scenario B: Zero-Shot Generation / Imputation

Perform gene imputation or virtual cell synthesis using the pretrained model.

python -m main.cli_xverse \
    --input_dir ./data/liver_samples \
    --output_dir ./results/zeroshot_imputation \
    --base_model /path/to/your/xVERSE_384.pth \
    --tissue_name liver \
    --mode 0shot \
    --task generation \
    --num_samples_gen 5

Scenario C: Fine-Tuning & Imputation

Fine-tune on your specific dataset to generate denoised expression (mu_bio) or virtual cells.

python -m main.cli_xverse \
    --input_dir ./data/liver_samples \
    --output_dir ./results/imputation \
    --base_model /path/to/your/xVERSE_384.pth \
    --tissue_name liver \
    --mode ft \
    --task generation \
    --num_samples_gen 5

Scenario D: Fine-Tuning followed by Embedding Extraction

Adapt the model to your specific dataset (e.g., to handle strong batch effects) before extracting embeddings.

python -m main.cli_xverse \
    --input_dir ./data/liver_samples \
    --output_dir ./results/ft_embeddings \
    --base_model /path/to/your/xVERSE_384.pth \
    --tissue_name liver \
    --mode ft \
    --task embedding \
    --epochs 20

📂 Repository Structure

xVERSE_code/
├── main/                           # Core xVERSE source code
│   ├── cli_xverse.py               # Main CLI entry point
│   ├── utils_model.py              # Model architecture definitions
│   └── ...
├── reproduce_manuscript/           # Scripts to reproduce paper figures
│   ├── fig1_overview/              # Figure 1: Model Overview
│   ├── fig2_biology_signal/        # Figure 2: Biological Signal & Benchmarking
│   ├── fig3_check_score_for_panel/ # Figure 3: Panel Analysis
│   ├── fig4_generate_single_cell/  # Figure 4: SC Generation
│   ├── fig5_imputation_spatial/    # Figure 5: Spatial Imputation
│   ├── fig6_small_sample/          # Figure 6: Small Sample Learning
│   └── fig7_cross_modality/        # Figure 7: Cross-Modality Prediction
├── bashfiles/                      # HPC Slurm/Bash scripts
├── requirements.txt                # Python dependencies
├── LICENSE                         # GNU GPL-3.0 License
└── README.md                       # Project documentation

⚖️ License

This project is open source under the GNU General Public License v3.0 (GPL-3.0) - see the LICENSE file for details.

Note

Commercial Use: This software is free for non-commercial use. For commercial use, please contact the authors to obtain a separate license:

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors