Transfer Learning vs Training from Scratch for Medical Image Segmentation
An empirical study comparing frozen DINOv3 encoders against baseline U-Net for skin lesion segmentation on the ISIC2018 dataset across different data regimes (25%, 50%, 100%).
Key Finding: Transfer learning dominates with limited data (+4.8% at 25%), but baseline U-Net surpasses frozen encoders with full data (+1.3% at 100%).
- Overview
- Key Results
- Installation
- Quick Start
- Project Structure
- Experiments
- Visualizations
- Citation
- License
This repository contains the complete implementation and analysis of an independent research project comparing transfer learning (frozen DINOv3 encoders) against training from scratch (baseline U-Net) for medical image segmentation.
Research Questions:
- Does transfer learning beat baseline in low-data scenarios?
- Does baseline catch up with full data?
- Is larger encoder always better (Small < Base < Large)?
Dataset: ISIC2018 Skin Lesion Analysis Challenge (2,594 training images)
Models Compared:
- Baseline U-Net (7.76M params, trained from scratch)
- DINOv3-Small + Custom Decoder (25M total, 4M trainable)
- DINOv3-Base + Custom Decoder (90M total, 4M trainable)
- DINOv3-Large + Custom Decoder (156M total, 4M trainable)
Motivation: As a recent graduate in Signal and Image Processing, I wanted to test some intuitions about foundation models and the self-supervised learning paradigm on a concrete use case. This project explores when pre-trained models (like DINOv3) actually provide value versus simpler approaches trained from scratch - particularly in data-constrained medical imaging scenarios.
Context: This work was conducted independently during my job search period to deepen my understanding of transfer learning trade-offs and provide practitioners with evidence-based guidance on model selection based on dataset size.
| Model | 25% Data | 50% Data | 100% Data |
|---|---|---|---|
| Baseline U-Net | 0.828 | 0.867 | 0.898 ⭐ |
| DINOv3-Small | 0.867 | 0.887 | 0.893 |
| DINOv3-Base | 0.876 | 0.897 | 0.885 |
| DINOv3-Large | 0.878 | 0.877 | 0.894 |
H1: Transfer Learning Dominates Low-Data Scenarios ✅
- +4.8% advantage at 25% data (650 images)
- 7× more data-efficient than training from scratch
- $12-14K saved in annotation costs
H2: Baseline Surpasses at Scale ✅
- Baseline wins at 100% data (+1.3% over DINOv3-Base)
- Complete reversal from low-data regime
- Win rate: 43.5% → 62.5%
H3: Size Hierarchy Doesn't Hold ❌
- No consistent Small < Base < Large hierarchy
- DINOv3-Base peaks at 50%, then declines
- Optimal model depends on data regime
< 1000 images: Use DINOv3 (frozen)
- ROI: 10-20× on annotation costs
- Better robustness on hard cases
> 2000 images: Use Baseline U-Net
- Simpler, faster, better performance
- Lower computational requirements
- Python 3.11+
- PyTorch 2.0+
# Clone the repository
git clone https://github.com/getrichthroughcode/dinov3-isic2018-segmentation.git
cd dinov3-isic2018-segmentation
# Create virtual environment
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Install package in development mode
pip install -e .-
Download ISIC2018 dataset from official source
-
The dataset is managed automatically using the any-gold library:
- Images are cached locally for efficient access
- No manual organization required
- Data splits handled by the library
Train Baseline U-Net (100% data):
python scripts/train.py \
--model baseline \
--data-fraction 1.0 \
--epochs 50 \
--batch-size 8 \
--lr 3e-4 \
--output-dir runs/baseline_100_percentTrain DINOv3-Base (50% data):
python scripts/train.py \
--model dinov3_base \
--data-fraction 0.5 \
--epochs 50 \
--batch-size 8 \
--lr 3e-4 \
--output-dir runs/dinov3b_unet_50_percentpython scripts/evaluate.py \
--model-path runs/baseline_100_percent/best.pt \
--data-split test \
--output-dir results/python scripts/visualize_samples.py \
--model-path runs/baseline_100_percent/best.pt \
--num-samples 10 \
--output-dir visualizations/dinov3-isic2018-segmentation/
│
├── src/dinoseg/ # Main package
│ ├── models/
│ │ ├── baseline_unet.py # U-Net implementation
│ │ └── dino_v3_unet.py # DINOv3-UNet architecture
│ ├── training/
│ │ └── trainer.py # Training loop
│ ├── data/
│ │ └── loader.py # Data loading with any-gold
│ └── utils/
│ ├── metrics.py # Dice, HD95 metrics
│ ├── viz.py # Visualization utilities
│ └── seed.py # Reproducibility
│
├── scripts/ # Executable scripts
│ ├── train.py # Training script
│ ├── evaluate.py # Evaluation script
│ └── visualize_samples.py # Visualization
│
├── assets/ # Result visualizations
│ ├── full_data/ # 100% data results
│ ├── moderate_data/ # 50% data results
│ └── low_data/ # 25% data results
│
├── tests/ # Unit tests
│ ├── test_forward.py # Model forward pass tests
│ └── test_metrics.py # Metric calculation tests
│
│
├── requirements.txt # Python dependencies
├── pyproject.toml # Package configuration
├── Makefile # Common commands
├── README.md # This file
└── LICENSE # MIT License
**Note**: Dataset is managed by `any-gold` library and cached locally (default: `~/.cache/isic2018/`)
- 25%: ~650 images (low-data scenario)
- 50%: ~1,300 images (medium-data scenario)
- 100%: ~2,594 images (full dataset)
All models trained with:
- Optimizer: AdamW (lr=3e-4, weight_decay=1e-4)
- Scheduler: CosineAnnealingLR
- Loss: Binary Cross-Entropy with Logits
- Batch size: 8
- Epochs: 100 (with early stopping)
- Preprocessing: Images resized to 256×256
- Dice Coefficient:
- Intersection over Union (Jaccard):
Sample visualizations from the experiments:
| 25% Data | 50% Data | 100% Data |
|---|---|---|
![]() |
![]() |
![]() |
Examples showing model calibration:
Visualization of inter-model consensus:
More visualizations available in assets/ directory.
# 1. Install dependencies
make install
make train-all
# 3. Evaluate all models
make evaluate-all
# 4. Generate visualizations
make visualize-allNote: The dataset is automatically downloaded and cached by any-gold during first training run.
# Train specific model at specific data fraction
make train MODEL=baseline FRACTION=1.0
make train MODEL=dinov3_base FRACTION=0.5
make train MODEL=dinov3_small FRACTION=0.25
make train MODEL=dinov3_large FRACTION=1.0Standard encoder-decoder architecture:
- Parameters: 7.76M (all trainable)
- Encoder: 4 levels with MaxPool downsampling
- Decoder: Transposed convolution upsampling
- Skip connections: Concatenation
- Trained from scratch on ISIC2018
Hybrid architecture with frozen encoder:
- Encoder: Frozen DINOv3 Vision Transformer (pre-trained on LVD-142M)
- DINO Adapter: Fuses frozen features with spatial details
- Shared Context Aggregator: Extracts global scene understanding
- FAPM: Preserves fine-grained details during feature compression
- Decoder: Standard U-Net decoder
Trainable parameters:
- Small: 4M / 25M (16%)
- Base: 4M / 90M (4%)
- Large: 4M / 152M (2%)
Architecture inspired by Dino U-Net (Gao et al., 2025), re-implemented from scratch.
Detailed analysis of results available in:
- Published Blog: [Link to Medium/Blog] (coming soon - curating it)
If you use this code or findings in your research, please cite:
@misc{diallo2026transfer,
title={Transfer Learning vs Training from Scratch for Medical Image Segmentation:
An Empirical Study on ISIC2018},
author={Diallo, Abdoulaye},
year={2026},
howpublished={GitHub repository},
url={https://github.com/getrichthroughcode/dinov3-isic2018-segmentation}
}This project is licensed under the MIT License - see LICENSE file for details.
- Dataset: ISIC2018 Skin Lesion Analysis Challenge
- Foundation Model: DINOv3 by Meta AI (Oquab et al., 2023)
- Architecture Inspiration: Dino U-Net (Gao et al., 2025)
- Data Management: any-gold library for efficient dataset handling
Abdoulaye Diallo Signal and Image processing Engineer LinkedIn: [https://www.linkedin.com/in/abdiallo-ai]
- Initial release
- Complete implementation of baseline U-Net and DINOv3-UNet variants
- Experiments on 3 data regimes (25%, 50%, 100%)
- Comprehensive analysis with 96 visualizations




