Nextflow pipeline for automated single-cell cell type annotation using scVI embeddings and random forest classification. Designed to annotate cell types from single-cell data loaded into the Gemma database. Cell types are assigned using a random forest classifier trained on scVI embeddings from the CellxGene data corpus123.
- Pipeline Summary
- Pipeline Structure
- Requirements
- Installation
- Quick Start
- Usage
- Parameters
- Output
- QC and Outlier Detection
- References
- Input validation - Validates study names or paths and downloads data if needed
- Reference preparation - Downloads scVI model and pulls reference embeddings from CellxGene Census
- Query processing - Generates scVI embeddings for query datasets
- Cell type classification - Classifies cells using random forest on scVI embeddings
- QC reporting - Generates QC metrics and MultiQC reports with outlier detection
- Gemma upload (optional) - Uploads annotations to Gemma database
This pipeline follows nf-core best practices with a modular architecture.
sc-annotation-pipeline-rachel-dev/
├── main.nf # Entry point
├── nextflow.config # Main configuration
├── nextflow_schema.json # Nextflow schema
├── workflows/
│ └── scannotate.nf # Main workflow orchestration
├── subworkflows/
│ └── local/
│ ├── input_check/
│ ├── prepare_reference/
│ ├── process_queries/
│ ├── classify_celltypes/
│ ├── qc_reporting/
│ └── gemma_upload/
├── modules/
│ └── local/
├── conf/
│ ├── base.config # Resource defaults (CPU/memory/time)
│ ├── modules.config # Module-specific settings
│ ├── test_mmus.config # Mouse test profile
│ └── test_hsap.config # Human test profile
├── assets/ # Reference files and configs
│ ├── samplesheet.csv
│ ├── cell_type_markers.tsv
│ └── ...
├── bin/ # Python scripts
├── params.mm.json # Mouse parameters
├── params.hs.json # Human parameters
├── README.md
└── ...
| Subworkflow | Description |
|---|---|
INPUT_CHECK |
Validates inputs and downloads/prepares studies from Gemma |
PREPARE_REFERENCE |
Downloads scVI model and Census reference data |
PROCESS_QUERIES |
Processes query data through scVI model to generate embeddings |
CLASSIFY_CELLTYPES |
Random forest classification using reference embeddings |
QC_REPORTING |
QC analysis, outlier detection, and MultiQC report generation |
GEMMA_UPLOAD |
Uploads cell type annotations and QC masks to Gemma |
Two test profiles are provided:
- test_mouse: Runs a minimal pipeline test with mouse (Mus musculus) parameters.
- test_human: Runs a minimal pipeline test with human (Homo sapiens) parameters.
Use with
-profile test_mouse,condaor-profile test_human,conda.
- Nextflow >= 23.04.0
- Conda/Mamba or Singularity/Docker for environment management
Stable release is installed in:
/space/grp/Pipelines/sc-annotation-pipeline
For development:
git clone https://github.com/PavlidisLab/sc-annotation-pipeline.git
cd sc-annotation-pipeline# Using a samplesheet (recommended)
nextflow run main.nf -profile conda -params-file params.mm.json \
--input samplesheet.csv
# Mouse studies (from study names)
nextflow run main.nf -profile conda -params-file params.mm.json \
--study_names "GSE154208 GSE123456"
# Human studies (from paths)
nextflow run main.nf -profile conda -params-file params.hs.json \
--study_paths "/path/to/study1 /path/to/study2"
# Run with human test profile
nextflow run main.nf -profile test_human,conda
# Run with mouse test profile
nextflow run main.nf -profile test_mouse,condaThe pipeline supports three input methods:
- Samplesheet (
--input) - Recommended, most flexible - Study names (
--study_names) - Download from Gemma by name - Study paths (
--study_paths) - Use pre-downloaded local data
A samplesheet is a CSV file that allows you to mix studies to download and local paths in a single run.
Format:
sample,study_name,study_path
GSE154208,GSE154208,
GSE123456,GSE123456,
local_study,,/path/to/local/data| Column | Required | Description |
|---|---|---|
sample |
Yes | Unique sample identifier |
study_name |
No* | Study name to download from Gemma |
study_path |
No* | Path to pre-downloaded MEX data |
*At least one of study_name or study_path must be provided per row.
Example usage:
nextflow run main.nf -profile conda -params-file params.mm.json --input assets/samplesheet.csvAn example samplesheet is provided at assets/samplesheet.csv.
Download studies directly from Gemma by name:
# Space-separated list
nextflow run main.nf -profile conda -params-file params.mm.json \
--study_names "experiment1 experiment2"
# Text file (one study per line)
nextflow run main.nf -profile conda -params-file params.mm.json \
--study_names studies.txtUse pre-downloaded MEX data:
# Space-separated list
nextflow run main.nf -profile conda -params-file params.mm.json \
--study_paths "/data/gemma/experiment1 /data/gemma/experiment2"
# Text file (one path per line)
nextflow run main.nf -profile conda -params-file params.mm.json \
--study_paths paths.txt| Profile | Description |
|---|---|
conda |
Use Conda for environment management |
test_human |
Test configuration for human data |
test_mouse |
Test configuration for mouse data |
SLURM execution is enabled by default and does not require a separate profile.
nextflow run main.nf -profile conda -resume -params-file params.mm.jsonSpecify a custom work directory to keep intermediate files organized:
nextflow run main.nf -profile conda -work-dir /scratch/my_workdir ...| Parameter | Description | Default |
|---|---|---|
--input |
Samplesheet CSV file (recommended) | null |
--study_names |
Study names (space-separated or file, legacy) | null |
--study_paths |
Study paths (space-separated or file, legacy) | null |
--outdir |
Output directory | Auto-generated with timestamp |
--publish_dir_mode |
Method for publishing files | copy |
| Parameter | Description | Default |
|---|---|---|
--organism |
Species (mus_musculus or homo_sapiens) |
mus_musculus |
--census_version |
CellxGene Census version | 2024-07-01 |
--organ |
Organ to sample from Census | brain |
--subsample_ref |
Cells per cell type to subsample | 500 |
--ref_collections |
Reference collection names (list) | See params.json |
--ref_keys |
Reference annotation keys | ["subclass_cell_type", "class_cell_type", "family_cell_type"] |
| Parameter | Description | Default |
|---|---|---|
--cutoff |
Min probability to assign label | 0 |
--seed |
Random seed | 42 |
--process_samples |
Process samples individually | false |
--preferredCtaLevel |
Preferred annotation level | class |
| Parameter | Description | Default |
|---|---|---|
--nmads |
MADs for outlier detection (map with per-metric thresholds) | [mito: 20, umi: 5, genes: 5, counts: 5] |
The nmads parameter accepts a map with separate thresholds for each QC metric:
mito: Mitochondrial content outliersumi: UMI count outliersgenes: Gene count outlierscounts: Counts relationship outliers (genes vs UMI linear model)
Override specific metrics on the command line:
nextflow run main.nf --nmads.mito 3 --nmads.counts 4| Parameter | Description |
|---|---|
--rename_file |
Cell type renaming/selection file |
--markers_file |
Marker genes for QC plotting |
--gene_mapping |
NCBI to ENSEMBL/HGNC mapping |
--multiqc_config |
MultiQC configuration |
--original_celltype_columns |
Original cell type column mapping |
--author_annotations_path |
Author-provided annotations directory |
| Parameter | Description | Default |
|---|---|---|
--use_staging |
Use Gemma staging server | true |
--upload_cta |
Upload cell type annotations | true |
--upload_clc |
Upload cell-level characteristics | true |
--upload_mask |
Upload outlier mask | true |
--upload_multiqc |
Upload MultiQC report | true |
The specified ctaProtocol must exist in Gemma prior to upload. The ctaProtocol is determined by the pipeline version e.g. sc-pipeline-2.0.0dev. The corresponding ctaName is determined by a combintion of pipeline version and level of granularity, e.g. sc-pipeline-2.0.0dev-class. To create a ctaProtocol in gemma, run:
gemma-cli addProtocol --name "sc-pipeline-<new version>"
Note: Set GEMMA_USERNAME and GEMMA_PASSWORD environment variables for Gemma uploads.
Pre-configured parameter files are provided:
params.mm.json- Mouse (mus_musculus)params.hs.json- Human (homo_sapiens)
Parameter priority: CLI arguments > params.json > nextflow.config
results/
└── mus_musculus_subsample_ref_500_2025-01-15_17-51-37/
├── celltypes/
│ └── *_predicted_celltype.tsv # Cell type predictions
├── masks/
│ └── *_outlier_mask.tsv # QC outlier masks
├── multiqc/
│ └── multiqc_report.html # QC summary report
├── pipeline_info/
│ ├── execution_report.html # Nextflow execution report
│ ├── execution_timeline.html # Timeline visualization
│ ├── execution_trace.txt # Resource usage trace
│ └── pipeline_dag.dot # Pipeline DAG
└── versions/
└── software_versions.yml # Software versions used
A custom MultiQC report is generated by process_QC.py for each experiment. Outliers are defined using Median Absolute Deviations (MAD) per-sample following best practices4.
Cells are marked as outliers if:
where --nmads) and
| Metric | scanpy field | nmads key | Default |
|---|---|---|---|
| Mitochondrial | pct_counts_mito |
mito |
20 |
| Gene content | log1p_n_genes_by_counts |
genes |
5 |
| UMI content | log1p_total_counts |
umi |
5 |
Note on mitochondrial threshold: The default mito threshold is set higher (20 MADs) than other metrics because certain cell types (e.g., astrocytes) exhibit higher mitochondrial gene expression. A stricter threshold can lead to disproportionate false-positive outlier calls for these cell types.
"Counts" outliers are cells whose gene counts deviate from the expected log-linear relationship with UMI counts:
where fitted values come from:
Outliers:
The threshold --nmads.counts (default: 5).
Doublets are predicted using the Scanpy implementation of Scrublet Wolock et al., 2019.
- Lim N., et al., Curation of over 10,000 transcriptomic studies to enable data reuse. Database, 2021.
- CZI Single-Cell Biology Program, et al. "CZ CELL×GENE Discover: A Single-Cell Data Platform for Scalable Exploration, Analysis and Modeling of Aggregated Data," November 2, 2023. https://doi.org/10.1101/2023.10.30.563174.
- Lopez, Romain, et al. "Deep Generative Modeling for Single-Cell Transcriptomics." Nature Methods 15, no. 12 (December 2018): 1053–58. https://doi.org/10.1038/s41592-018-0229-2.
- Heumos, L., Schaar, A.C., Lance, C. et al. Best practices for single-cell analysis across modalities. Nat Rev Genet (2023). https://doi.org/10.1038/s41576-023-00586-w