The preprocessing module provides the foundational tools for loading, cleaning, and preparing single-cell RNA-seq data for analysis.
The SingleCellData class is the central data structure for single-cell analysis, providing an AnnData-like interface:
class SingleCellData:
"""
Container for single-cell expression data and metadata.
Attributes:
X: Expression matrix (cells × genes), can be sparse or dense
obs: Cell metadata (DataFrame with cell_id as index)
var: Gene metadata (DataFrame with gene_id as index)
obsm: Multi-dimensional cell annotations (dict of arrays)
varm: Multi-dimensional gene annotations (dict of arrays)
uns: Unstructured annotations (dict)
n_obs: Number of cells
n_vars: Number of genes
"""from metainformant.singlecell.preprocessing import SingleCellData
import numpy as np
import pandas as pd
# Create from arrays
X = np.random.poisson(2, (100, 1000)) # 100 cells, 1000 genes
obs = pd.DataFrame({'cell_type': ['TypeA'] * 50 + ['TypeB'] * 50})
var = pd.DataFrame({'gene_name': [f'Gene_{i}' for i in range(1000)]})
data = SingleCellData(X=X, obs=obs, var=var)Load count matrices from various file formats:
from metainformant.singlecell.preprocessing import load_count_matrix
# From CSV (genes as rows, cells as columns)
data = load_count_matrix("counts.csv", format="csv")
# From TSV with transposition (cells as rows, genes as columns)
data = load_count_matrix("counts.tsv", format="tsv", transpose=True)
# From Matrix Market format
data = load_count_matrix("matrix.mtx", format="mtx")
# From HDF5 (requires h5py)
data = load_count_matrix("counts.h5", format="h5", h5_key="matrix")Parameters:
file_path: Path to the count matrix fileformat: File format ("csv", "tsv", "mtx", "h5")transpose: Whether to transpose the matrix (default: False)h5_key: HDF5 dataset key (for H5 format)sep: Delimiter for CSV/TSV files
Returns: SingleCellData object with loaded expression matrix
Calculate standard quality control metrics for cells:
from metainformant.singlecell.preprocessing import calculate_qc_metrics
# Calculate QC metrics
data = calculate_qc_metrics(data)
# Access calculated metrics
print(data.obs.columns) # ['total_counts', 'n_genes', 'pct_mt', 'pct_ribo']Calculated metrics:
total_counts: Total UMI/read counts per celln_genes: Number of detected genes per cellpct_mt: Percentage of mitochondrial gene expressionpct_ribo: Percentage of ribosomal gene expression
Mitochondrial genes: Identified by gene names starting with "MT-", "mt-", or "Mt-" Ribosomal genes: Identified by gene names starting with "RPS", "RPL", "rps", or "rpl"
Filter cells based on QC metrics:
from metainformant.singlecell.preprocessing import filter_cells
# Filter cells with basic thresholds
data_filtered = filter_cells(
data,
min_genes=200, # Minimum genes detected
max_genes=5000, # Maximum genes detected
min_counts=1000, # Minimum total counts
max_counts=50000, # Maximum total counts
max_pct_mt=20, # Maximum mitochondrial percentage
max_pct_ribo=None # Maximum ribosomal percentage (optional)
)Parameters:
min_genes: Minimum number of genes per cellmax_genes: Maximum number of genes per cellmin_counts: Minimum total counts per cellmax_counts: Maximum total counts per cellmax_pct_mt: Maximum mitochondrial gene percentagemax_pct_ribo: Maximum ribosomal gene percentage
Filter genes based on detection across cells:
from metainformant.singlecell.preprocessing import filter_genes
# Filter rarely expressed genes
data_filtered = filter_genes(
data,
min_cells=3, # Gene must be expressed in at least 3 cells
min_counts=None # Optional minimum total counts per gene
)Normalize count data to account for differences in sequencing depth:
from metainformant.singlecell.preprocessing import normalize_counts
# Total count normalization (most common)
data_norm = normalize_counts(
data,
target_sum=1e4, # Scale to 10,000 counts per cell
method='total_count' # Normalization method
)
# Median normalization
data_norm = normalize_counts(data, method='median')
# Custom scaling factors
scaling_factors = data.obs['total_counts'].median() / data.obs['total_counts']
data_norm = normalize_counts(data, scaling_factors=scaling_factors)Methods:
'total_count': Scale each cell totarget_sumtotal counts'median': Scale each cell to the median total count across all cells- Custom: Provide scaling factors directly
Apply log transformation to normalized counts:
from metainformant.singlecell.preprocessing import log_transform
# Log2 transformation with pseudocount
data_log = log_transform(data, base=2, pseudocount=1)
# Natural log transformation
data_log = log_transform(data, base='e')
# Log10 transformation
data_log = log_transform(data, base=10)Parameters:
base: Logarithm base (2, 10, 'e')pseudocount: Value added before log transformation
Z-score standardization of expression data:
from metainformant.singlecell.preprocessing import scale_data
# Standard z-score scaling (mean=0, std=1)
data_scaled = scale_data(data, center=True, scale=True)
# Center only (mean=0)
data_centered = scale_data(data, center=True, scale=False)
# Scale to specific genes only (e.g., highly variable genes)
hvg_indices = data.var['highly_variable'].values
data_scaled = scale_data(data, gene_subset=hvg_indices)Parameters:
center: Whether to center data (subtract mean)scale: Whether to scale data (divide by standard deviation)gene_subset: Boolean array to select specific genesmax_value: Clip scaled values to this maximum (default: 10)
Here's a typical preprocessing workflow:
from metainformant.singlecell.preprocessing import (
load_count_matrix, calculate_qc_metrics, filter_cells,
filter_genes, normalize_counts, log_transform, scale_data
)
# 1. Load data
data = load_count_matrix("raw_counts.csv", transpose=True)
print(f"Loaded: {data.n_obs} cells × {data.n_vars} genes")
# 2. Calculate QC metrics
data = calculate_qc_metrics(data)
# 3. Filter cells and genes
print("Before filtering:")
print(f" Cells: {data.n_obs}")
print(f" Genes: {data.n_vars}")
data = filter_cells(data, min_genes=200, max_genes=5000, max_pct_mt=20)
data = filter_genes(data, min_cells=3)
print("After filtering:")
print(f" Cells: {data.n_obs}")
print(f" Genes: {data.n_vars}")
# 4. Normalize and transform
data = normalize_counts(data, target_sum=1e4, method='total_count')
data = log_transform(data, base=2)
# 5. Scale data (typically done after HVG selection)
data = scale_data(data)
print("Preprocessing complete!")- CSV/TSV files: Standard text files with headers
- Matrix Market:
.mtxfiles (sparse format) - HDF5: Binary format for large datasets
- Gzip compressed: Automatic detection and decompression
- Genes as rows, cells as columns (default)
- Cells as rows, genes as columns (use
transpose=True) - Gene names: Should be in first column or as row indices
- Cell names: Should be in first row or as column headers
The module automatically handles sparse matrices to reduce memory usage:
import scipy.sparse as sparse
# Create sparse matrix
X_sparse = sparse.csr_matrix(X_dense)
data = SingleCellData(X=X_sparse, obs=obs, var=var)
# Check if matrix is sparse
print(f"Matrix is sparse: {sparse.issparse(data.X)}")- Filtering operations preserve sparsity
- Normalization can work in-place to save memory
- Large datasets automatically use sparse representations
Common issues and solutions:
try:
data = load_count_matrix("counts.csv")
except FileNotFoundError:
print("Count matrix file not found")
except ValueError as e:
print(f"Invalid file format: {e}")# Check for empty results after filtering
data_filtered = filter_cells(data, min_genes=200)
if data_filtered.n_obs == 0:
print("Warning: No cells passed filtering criteria")The preprocessing module integrates with METAINFORMANT's core utilities:
from metainformant.core.io import ensure_directory
# Use core I/O utilities for output paths
output_dir = ensure_directory("output/singlecell_analysis")- Use sparse matrices for large, sparse datasets
- Filter early to reduce computational burden
- Batch processing for very large datasets
- Memory monitoring for datasets > 50k cells
Preprocessing tests are available in tests/test_singlecell_preprocessing.py:
# Run preprocessing tests
uv run pytest tests/test_singlecell_preprocessing.py -v- Dimensionality Reduction: For HVG selection and PCA
- Quality Control: For raw sequencing QC
- Core I/O: For file handling utilities