Self-supervised identification and elimination of harmful datasets in distributed machine learning for medical image analysis
Implementation for decentralized quality control that is published by the npj Digital Medicine: "[Self-supervised identification and elimination of harmful datasets in distributed machine learning for medical image analysis] (https://doi.org/10.1038/s41746-025-01499-0).
Our code here is based on the investigation of a Parkinson's disease classification using non-identical distribution across 83 centers, including simulated harmful data samples to investigate when entire centers provide only harmful data and when a single hamrful data sample is added to otherwise good datasets.
If you find our framework, code, or paper useful to your research, please cite us!
@article{Souza2025,
author = {Raissa Souza and Emma A.M. Stanley and Anthony J. Winder and Chris Kang and Kimberly Amador and Erik Y. Ohara and Gabrielle Dagasso and Richard Camicioli and Oury Monchi and Zahinoor Ismail and Matthias Wilms and Nils D. Forkert},
doi = {10.1038/s41746-025-01499-0},
issn = {2398-6352},
issue = {1},
journal = {npj Digital Medicine 2025 8:1},
month = {2},
pages = {1-14},
publisher = {Nature Publishing Group},
title = {Self-supervised identification and elimination of harmful datasets in distributed machine learning for medical image analysis},
volume = {8},
url = {https://www.nature.com/articles/s41746-025-01499-0},
year = {2025}
}
Souza, R., Stanley, E. A. M., Winder, A. J., Kang, C., Amador, K., Ohara, E. Y., Dagasso, G., Camicioli, R., Monchi, O., Ismail, Z., Wilms, M., & Forkert, N. D. (2025). Self-supervised identification and elimination of harmful datasets in distributed machine learning for medical image analysis. Npj Digital Medicine 2025 8:1, 8(1), 1–14. https://doi.org/10.1038/s41746-025-01499-0
Distributed learning enables collaborative machine learning model training without requiring cross-institutional data sharing, thereby addressing privacy concerns. However, local quality control variability can negatively impact model performance while systematic human visual inspection is time-consuming and may violate the goal of keeping data inaccessible outside acquisition centers. This work proposes a novel self-supervised method to identify and eliminate harmful data during distributed learning model training fully-automatically. Harmful data is defined as samples that, when included in training, increase misdiagnosis rates. The method was tested using neuroimaging data from 83 centers for Parkinson’s disease classification with simulated inclusion of a few harmful data samples. The proposed method reliably identified harmful images, with centers providing only harmful datasets being easier to identify than single harmful images within otherwise good datasets. While only evaluated using neuroimaging data, the presented method is application-agnostic and presents a step towards automated quality control in distributed learning.
- dirty_baselines: load a pre-trained model and continue to train it including all datasets (good and harmful) to evaluate how the model would perform if harmful data samples are included in the training.
- generate_harmful_samples: cointains a python script that load a MRR dataset and invert it to get the inverted MRI sample and add noise and clip the brain tissue to get the pure noise sample.
- inference: has a script that generate the metrics (accuracy, sensitivity, specificity, F1 score, AUROC for the overall dataset and for subgroups based on sex and age) for the models per cycle.
- standard_travelling_model: train a classifier without quality control. Use this one to get the clean baseline and the pre-trained model.
- travelling_quality_control: train a classifier with quality control steps: verification, revisit, elimination. This script alo generate a file that shows false positive rate, false negative rate, and a binary flag that determines if an image was flagged and eliminated.
All scripts have parameters that need to be called with descriptions in the argument parser. An example of how to call all of them:
python enc_PD_train_dirt_baseline.py -fn_train ./training_set.csv -en ./path_pretrained_encoder -pd ./path_pretrained_classifier -cycles 30 -epochs 1 -batch_size 5
For the generate harmful data samples replace the name of the nifti file in the script.
python inference_pd_distributed.py -fn ./test_set.csv -en ./path_encoder -pd ./path_classifier -o filename_to_save
For the inference, you can change the loop indices to determine the range of models you want to evaluate. For example, for the clean baseline, you want to check the performance from 0 to 30 (max cycle trained); however, for the dirty baselines, you want to check from 10 to 30 because it uses a pre-trained model.
python enc_PD_train_distributed.py -fn_train ./training_set.csv -cycles 30 -epochs 1 -batch_size 5
You may change the fixed name to save the models and the folder you want to save them.
python enc_PD_quality_control.py -fn_train ./training_set.csv -fn_test ./test_set.csv -cycles 30 -en ./path_encoder -pd ./path_classifier -revisit 2 -error 0.02 -fn_save filename
You may change the number of local epochs (now fixed to 1) and batch_size (now fixed to 5 when dataset >=5).
Our code for the Keras model pipeline used:
- Python 3.10.6
- pandas 1.5.0
- numpy 1.23.3
- scikit-learn 1.1.2
- simpleitk 2.1.1.1
- tensorflow-gpu 2.10.0
- cudnn 8.4.1.50
- cudatoolkit 11.7.0
GPU: NVIDIA GeForce RTX 3090
Full environment in requirements.txt.
- Questions? Open an issue or send an email.

