Self-supervised identification and elimination of harmful datasets in distributed machine learning for medical image analysis

Implementation for decentralized quality control that is published by the npj Digital Medicine: "[Self-supervised identification and elimination of harmful datasets in distributed machine learning for medical image analysis] (https://doi.org/10.1038/s41746-025-01499-0).

Our code here is based on the investigation of a Parkinson's disease classification using non-identical distribution across 83 centers, including simulated harmful data samples to investigate when entire centers provide only harmful data and when a single hamrful data sample is added to otherwise good datasets.

If you find our framework, code, or paper useful to your research, please cite us!


@article{Souza2025,
   author = {Raissa Souza and Emma A.M. Stanley and Anthony J. Winder and Chris Kang and Kimberly Amador and Erik Y. Ohara and Gabrielle Dagasso and Richard Camicioli and Oury Monchi and Zahinoor Ismail and Matthias Wilms and Nils D. Forkert},
   doi = {10.1038/s41746-025-01499-0},
   issn = {2398-6352},
   issue = {1},
   journal = {npj Digital Medicine 2025 8:1},
   month = {2},
   pages = {1-14},
   publisher = {Nature Publishing Group},
   title = {Self-supervised identification and elimination of harmful datasets in distributed machine learning for medical image analysis},
   volume = {8},
   url = {https://www.nature.com/articles/s41746-025-01499-0},
   year = {2025}
}

Souza, R., Stanley, E. A. M., Winder, A. J., Kang, C., Amador, K., Ohara, E. Y., Dagasso, G., Camicioli, R., Monchi, O., Ismail, Z., Wilms, M., & Forkert, N. D. (2025). Self-supervised identification and elimination of harmful datasets in distributed machine learning for medical image analysis. Npj Digital Medicine 2025 8:1, 8(1), 1–14. https://doi.org/10.1038/s41746-025-01499-0

Abstract

Distributed learning enables collaborative machine learning model training without requiring cross-institutional data sharing, thereby addressing privacy concerns. However, local quality control variability can negatively impact model performance while systematic human visual inspection is time-consuming and may violate the goal of keeping data inaccessible outside acquisition centers. This work proposes a novel self-supervised method to identify and eliminate harmful data during distributed learning model training fully-automatically. Harmful data is defined as samples that, when included in training, increase misdiagnosis rates. The method was tested using neuroimaging data from 83 centers for Parkinson’s disease classification with simulated inclusion of a few harmful data samples. The proposed method reliably identified harmful images, with centers providing only harmful datasets being easier to identify than single harmful images within otherwise good datasets. While only evaluated using neuroimaging data, the presented method is application-agnostic and presents a step towards automated quality control in distributed learning.

Folder organization

dirty_baselines: load a pre-trained model and continue to train it including all datasets (good and harmful) to evaluate how the model would perform if harmful data samples are included in the training.
generate_harmful_samples: cointains a python script that load a MRR dataset and invert it to get the inverted MRI sample and add noise and clip the brain tissue to get the pure noise sample.
inference: has a script that generate the metrics (accuracy, sensitivity, specificity, F1 score, AUROC for the overall dataset and for subgroups based on sex and age) for the models per cycle.
standard_travelling_model: train a classifier without quality control. Use this one to get the clean baseline and the pre-trained model.
travelling_quality_control: train a classifier with quality control steps: verification, revisit, elimination. This script alo generate a file that shows false positive rate, false negative rate, and a binary flag that determines if an image was flagged and eliminated.

Step-by-step implementation

All scripts have parameters that need to be called with descriptions in the argument parser. An example of how to call all of them:

python enc_PD_train_dirt_baseline.py -fn_train ./training_set.csv -en ./path_pretrained_encoder -pd ./path_pretrained_classifier -cycles 30 -epochs 1 -batch_size 5

For the generate harmful data samples replace the name of the nifti file in the script.

python inference_pd_distributed.py -fn ./test_set.csv -en ./path_encoder -pd ./path_classifier -o filename_to_save

For the inference, you can change the loop indices to determine the range of models you want to evaluate. For example, for the clean baseline, you want to check the performance from 0 to 30 (max cycle trained); however, for the dirty baselines, you want to check from 10 to 30 because it uses a pre-trained model.

python enc_PD_train_distributed.py -fn_train ./training_set.csv -cycles 30 -epochs 1 -batch_size 5

You may change the fixed name to save the models and the folder you want to save them.

python enc_PD_quality_control.py -fn_train ./training_set.csv -fn_test ./test_set.csv  -cycles 30 -en ./path_encoder -pd ./path_classifier -revisit 2 -error 0.02 -fn_save filename

You may change the number of local epochs (now fixed to 1) and batch_size (now fixed to 5 when dataset >=5).

Environment

Our code for the Keras model pipeline used:

Python 3.10.6
pandas 1.5.0
numpy 1.23.3
scikit-learn 1.1.2
simpleitk 2.1.1.1
tensorflow-gpu 2.10.0
cudnn 8.4.1.50
cudatoolkit 11.7.0

GPU: NVIDIA GeForce RTX 3090

Full environment in requirements.txt.

Resources

Questions? Open an issue or send an email.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
dirty_baselines		dirty_baselines
generate_harmful_samples		generate_harmful_samples
inference		inference
standard_travelling_model		standard_travelling_model
travelling_quality_control		travelling_quality_control
README.md		README.md
flowchart.png		flowchart.png
gitignore.txt		gitignore.txt
workflow.png		workflow.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Self-supervised identification and elimination of harmful datasets in distributed machine learning for medical image analysis

Abstract

Folder organization

Step-by-step implementation

All scripts have parameters that need to be called with descriptions in the argument parser. An example of how to call all of them:

Environment

Resources

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Self-supervised identification and elimination of harmful datasets in distributed machine learning for medical image analysis

Abstract

Folder organization

Step-by-step implementation

All scripts have parameters that need to be called with descriptions in the argument parser. An example of how to call all of them:

Environment

Resources

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages