xGATE is a topology-aware pathway scoring framework that reads pathway activity from the topological fingerprints of gene co-expression graphs in single-cell RNA-seq (scRNA-seq) and spatial transcriptomics data. By capturing topology structures in gene co-expression graphs, xGATE integrates graph theory, deep learning, and hypothesis testing to provide accurate and robust pathway activity estimates that are resilient to batch effects.
- Biologically-Informed & Flexible: Integrates established pathway databases (KEGG, Reactome, WikiPathways) while supporting custom gene sets tailored to your research questions
- High Accuracy: Leverages graph topology to capture biologically relevant co-expression patterns, delivering superior pathway activity identification across diverse cell types
- Batch-Effect Resilient: Relies on co-expression relationships rather than absolute expression levels, making it robust to batch effects common in multi-sample scRNA-seq studies
- Topologically Aware: Uses graph embeddings that capture meaningful network structures rather than treating pathways as unordered lists of genes
Input: Count matrix + Gene set/pathway of interest → Output: Pathway p-value + Effect size
- Network Construction: Builds a gene co-expression graph from your expression data
- Pathway Extraction: Identifies pathway subgraph within the co-expression network
- Graph Embeddings: Derives embeddings capturing topology features (centrality, walk properties, structural metrics)
- Null Distribution: Uses a graph-based Variational Autoencoder (VAE) to establish null distribution
- Hypothesis Testing: Performs statistical testing to assess pathway significance and compute effect sizes
- Python ≥ 3.8
- A working count matrix from scRNA-seq or spatial transcriptomics (preferably normalized via scTransform or similar)
# Clone the repository
git clone https://github.com/jichunxie/xGATE.git
cd xGATE
# Install dependencies
pip install -r requirements.txtAll required packages are specified in requirements.txt. The installation includes PyTorch (for VAE computations), networkx/igraph (for graph operations), karateclub (for graph features), and biological tools for pathway integration.
from utilities import (create_sifinet_object, quantile_thres2, cal_coexp,
create_network, filter_lowexp, convert_gene_ids,
create_network_from_adj_matrix, get_categorized_pathways,
analyze_pathways, embedding_recon)
import pandas as pd
import numpy as np
# 1. Load and preprocess your data
count_matrix = pd.read_csv("your_count_matrix.csv", index_col=0)
# Filter cells: keep genes expressed in at least 5% of cells
count_matrix = count_matrix.loc[(count_matrix > 0).sum(axis=1) > 0.05 * count_matrix.shape[1], :]
# Filter genes: keep genes with expression in at least 5% of cells
count_matrix = count_matrix.loc[:, (count_matrix > 0).sum(axis=0) > 0.05 * count_matrix.shape[0]]
# Remove genes with zero or constant expression
row_means = np.mean(count_matrix, axis=1)
count_matrix = count_matrix[~((row_means == 0) | (row_means == 1))]
# 2. Create co-expression network
df = pd.DataFrame(count_matrix)
so = create_sifinet_object(df, rowfeature=True) # Create data object
so = quantile_thres2(so) # Calculate thresholds
so = cal_coexp(so, X=so.data_thres['dt'], X_full=so.data_thres['dt']) # Compute co-expression
so = create_network(so, alpha=0.05, manual=False, least_edge_prop=0.01) # Build network
so = filter_lowexp(so, t1=10, t2=0.9, t3=0.9) # Filter low-expression edges
# 3. Prepare gene ID mapping and network
adj_matrix = pd.DataFrame(np.where(
np.abs(so.coexp - so.est_ms['mean']) > so.thres,
np.abs(so.coexp),
0
))
adj_matrix.index = df.index
adj_matrix.columns = df.index
adj_matrix = convert_gene_ids(adj_matrix, "ensembl") # Convert to Entrez IDs
# 4. Create igraph network and load pathways
G = create_network_from_adj_matrix(adj_matrix)
categorized_pathways = get_categorized_pathways() # Fetch KEGG pathways
# 5. Analyze pathways of interest
test_pathways = ["Cellular senescence", "Cell cycle", "Apoptosis"]
results = analyze_pathways(G, test_pathways, categorized_pathways,
num_walks=200, max_walk_length=200)
# Results include p-values and effect sizes for each pathway
print(results)The analyze_pathways() function returns a DataFrame with columns:
- pathway: Name of the pathway
- p-value: Statistical significance of pathway activity
- z-score: Magnitude of pathway activity (normalized reconstruction error)
Instead of predefined KEGG pathways, you can analyze any custom gene set:
# Define your custom pathway
custom_pathway = {
"Your_Pathway_Name": ["GENE1", "GENE2", "GENE3", "GENE4"]
}
# Analyze single custom pathway
results = embedding_recon(G, categorized_pathways,
custom_pathway["Your_Pathway_Name"],
num_walks=200, max_walk_length=200,
null_dist_size=100)Provides core data filtering, matrix thresholding, and normalization functions required to prepare raw single-cell transcriptomic count matrices for co-expression analysis.
Handles network construction from processed adjacency matrices, implements statistical thresholds for edge significance, and serves as the main entry point to perform testing against established structural databases (KEGG, Reactome).
Extracts graph topological metrics. It calculates metrics such as harmonic centrality, eccentricity, and longest random walk lengths to build node-level structural features that inform the pathway topology.
Performs competitive testing between target and alternate pathways. Uses a Variational Autoencoder to generate null distributions and compute competitive T-statistics and p-values for pathway significance.
Implements the underlying Variational Autoencoder (VAE) architecture in PyTorch. Used to generate synthetic background distributions for graph features to evaluate the statistical significance of observed pathway structures.
- Rows: Genes (or use
rowfeature=Trueto transpose) - Columns: Cells/samples
- Values: Raw or normalized counts
- Index/Columns: Gene identifiers (Ensembl or Gene Symbol recommended)
Note: Input data should be pre-normalized using appropriate methods (scTransform, log-CPM, etc.). xGATE works with the processed matrix without additional normalization.
xGATE supports multiple gene ID formats and can convert between:
- Ensembl IDs
- Gene Symbols
- Entrez Gene IDs
- HGNC symbols
Use convert_gene_ids() to map identifiers before network construction.
alpha: Significance level for co-expression threshold (default: 0.05)least_edge_prop: Minimum edge weight proportion (default: 0.01)manual: Whether to manually specify threshold (default: False)
num_walks: Number of random walks per node (default: 200)max_walk_length: Maximum walk length (default: 200)
null_dist_size: Number of random pathways for null distribution (default: 100)
For a complete worked example with sample data, see xGATE_Tutorial.ipynb in this repository.
Memory Error with Large Graphs
- Reduce
null_dist_sizeparameter in pathway analysis - Filter genes more stringentially before network construction
No Pathways Found
- Verify gene IDs match between count matrix and pathway database
- Check that genes in your pathways are present in the co-expression network
- Use
get_genes_in_pathway()to debug specific pathway membership
Gene ID Conversion Fails
- Ensure input IDs are in a supported format (Ensembl, Symbol, Entrez)
- Check for typos or outdated gene names
- Use MyGene.info API directly to troubleshoot mappings
- Network size: Works efficiently with 5,000-25,000 genes
- Cell count: Scales well up to 100,000+ cells
- Runtime: Typically 5-30 minutes depending on data size and parameters
- Memory: ~8-16 GB RAM recommended for large datasets
If you use xGATE in your research, please cite:
[Citation information to be added upon publication]
This project is licensed under CC BY-NC-ND 4.0 - see LICENSE.md for details.
For questions, issues, or feature requests, please open an issue on GitHub or contact the maintainers.
- KEGG Pathway Database: https://www.kegg.jp/
- Reactome: https://reactome.org/
- WikiPathways: https://www.wikipathways.org/
- PyTorch: https://pytorch.org/
- igraph: https://igraph.org/
