Download Data | Journal Article | Reproducibility | Notebooks | Pre-encoded UNI embeddings | Citation
SurGen is a publicly available colorectal cancer dataset comprising 1,020 whole slide images (WSIs) from 843 cases. Each WSI is digitised at 40× magnification (0.1112 µm/pixel) and is accompanied by key genetic marker; KRAS, NRAS, BRAF as well as mismatch repair (MMR) status and five-year survival data (available for 426 cases). This repository provides standard train-validation-test splits and example scripts to facilitate machine learning experiments in computational pathology, biomarker discovery, and prognostic modelling. SurGen aims to advance cancer diagnostics by offering a consistent, high-quality dataset with rich annotations, supporting both targeted research on primary colorectal tumours and broader studies of metastatic sites.
SurGen is split into two sub-cohorts:
- SR386 – Primary colorectal cancer (427 WSIs) with five-year survival data.
- SR1482 – Colorectal cancer cases (593 WSIs) including metastatic lesions (liver, lung, peritoneum), with full biomarker data.
Each WSI is stored in Zeiss .CZI format. For convenience, precomputed patch embeddings (extracted using the UNI foundation model) are also available.
The SurGen dataset is hosted on the EBI FTP server. You can download the Whole Slide Images (WSIs) for both sub-cohorts (SR386 and SR1482) using wget, an FTP client, or you can download directly from the EBI website.
For most, the easiest way to download the WSIs is via wget:
wget -r -np -nH --cut-dirs=6 ftp://ftp.ebi.ac.uk/biostudies/fire/S-BIAD/285/S-BIAD1285/Files/SR386_WSIs/wget -r -np -nH --cut-dirs=6 ftp://ftp.ebi.ac.uk/biostudies/fire/S-BIAD/285/S-BIAD1285/Files/SR1482_WSIs/This will download the respective data into SR386_WSIs/ and SR1482_WSIs/ folders in your current directory.
-npno parent (prevents downloading higher-level directories).-nHno host (ignores 'ftp.ebi.ac.uk' in the local directory structure).--cut-dirs=6ensures you get a clean directory structure without extra nested folders.
If you prefer to use FTP, follow these steps:
-
Open a terminal and connect to the FTP server:
ftp ftp.ebi.ac.uk
- When prompted, enter
anonymousas the username. - Press Enter for the password.
- When prompted, enter
-
Navigate to the SR386 directory:
cd /biostudies/fire/S-BIAD/285/S-BIAD1285/Files/SR386_WSIsOr for the SR1482 directory:
cd /biostudies/fire/S-BIAD/285/S-BIAD1285/Files/SR1482_WSIs -
Enable binary mode to correctly transfer
.CZIfiles:binary
-
Download all
.CZIfiles:prompt mget *.czi -
Close the ftp connection:
exit
You can also use an FTP GUI client such as FileZilla or Cyberduck:
- Host:
ftp.ebi.ac.uk - Username:
anonymous - Port:
21 - Path:
/biostudies/fire/S-BIAD/285/S-BIAD1285/Files/
The reproducibility directory contains step-by-step instructions to replicate the results shown in our DataNote paper. These include:
- Details on environment setup and required dependencies.
- Scripts for processing the WSIs and generating patch-level features.
- Guidelines for reproducing slide-level prediction results.
This ensures that all experiments can be reliably reproduced by other researchers using the provided dataset and embeddings.
The notebooks directory provides interactive examples for exploring the SurGen dataset and pre-extracted features:
simple_load_wsi_tile.ipynb– Demonstrates how to interact with.CZIfiles in Python, including reading and viewing from WSIs.patch_feature_extraction.ipynb– Shows how to extract patch-level features using Hugging Face models, this example uses the UNI foundation model.zarr_examined.ipynb– Explains the layout and usage of pre-extracted SurGen features stored in Zarr format, making it easier to integrate with downstream analysis pipelines.
These notebooks provide a practical starting point for using the dataset and applying it to various computational pathology tasks.
Disclaimer: The CSV files in the "reproducibility/dataset_csv" folder are not covered by the same license as the source code. These data files are dedicated to the public domain under CC0.
If you find this dataset or repository useful, please consider citing the following:
@article{myles2025surgen,
title={SurGen: 1020 H\&E-stained whole-slide images with survival and genetic markers},
author={Myles, Craig and Um, In Hwa and Marshall, Craig and Harris-Birtill, David and Harrison, David J},
journal={GigaScience},
volume={14},
pages={giaf086},
year={2025},
publisher={Oxford University Press}
}@inproceedings{myles2024leveraging,
title={Leveraging foundation models for enhanced detection of colorectal cancer biomarkers in small datasets},
author={Myles, Craig and Um, In Hwa and Harrison, David J and Harris-Birtill, David},
booktitle={Annual Conference on Medical Image Understanding and Analysis},
pages={329--343},
year={2024},
organization={Springer}
}