Ladder/dataset_zoo.md at main · batmanlab/Ladder

📚 Table of Contents (Datasets Section)

📁 Dataset Directory Structure
📥 Dataset Download
- ✅ Automated Download (Waterbirds & MetaShift)
- 📎 Manual Download Required
🧪 Preprocessing Mammograms
🗂️ Metadata Files

📁 Dataset Directory Structure

We follow the directory structure below for the datasets used in this project. The datasets are organized into subdirectories, each containing the necessary files for training and evaluation.

data/
├── celeba/
│   ├── img_align_celeba/
│   ├── list_attr_celeba.csv
│   ├── list_bbox_celeba.csv
│   ├── list_eval_partition.csv
│   ├── list_landmarks_align_celeba.csv
│   ├── metadata_celeba.csv
│   ├── va_metadata_celeba_captioning_blip.csv
│   └── va_metadata_celeba_captioning_GPT.csv
├── metashift/
│   ├── metadata_metashift.csv
│   ├── metadata_metashift_captioning.csv
│   ├── te_metadata_metashift_captioning.csv
│   ├── va_metadata_metashift_captioning_blip.csv
│   ├── va_metadata_metashift_captioning_gpt.csv
│   ├── va_metadata_metashift_captioning_GPT.csv
│   └── MetaShift-Cat-Dog-indoor-outdoor/
│       ├── train/
│       └── test/
├── nih/
│   ├── mimic-cxr-chexpert.csv
│   └── nih_processed_v2.csv
├── RSNA_Cancer_Detection/
│   └── rsna_w_upmc_concepts_breast_clip.csv
├── Vindr/
│   └── vindr-mammo-a-large-scale-benchmark-dataset-for-computer-aided-detection-and-diagnosis-in-full-field-digital-mammography-1.0.0/
│       ├── breast-level_annotations.csv
│       ├── finding_annotations.csv
│       └── vindr_detection_v1_folds_abnormal.csv
└── waterbirds/
    ├── metadata_waterbirds.csv
    ├── va_metadata_waterbirds_captioning_blip.csv
    ├── va_metadata_waterbirds_captioning_GPT.csv
    └── waterbird_complete95_forest2water2/
        ├── 001.Black_footed_Albatross/
        ├── 002.Laysan_Albatross/
        ├── 003.Sooty_Albatross/
        ├── ...
        ├── 200.Common_Yellowthroat/
        └── metadata.csv

📥 Dataset Download

We rely heavily on the Subpopulation Shift Benchmark (SubpopBench) codebase for downloading and processing datasets. Necessary compatibility modifications are included in our repo under src/codebase/SubpopBench-main:

✅ Automated Download (Waterbirds & MetaShift)

Use the following command to download the Waterbirds and MetaShift datasets:

python ./src/codebase/SubpopBench-main/subpopbench/scripts/download.py \
  --datasets "waterbirds" "metashift" \
  --data_path "Ladder/data/new" \
  --download True

📎 Manual Download Required

The following datasets must be downloaded manually from their official sources:

Preprocessing Mammograms

Follow the steps in Mammo-CLIP codebase to preprocess the mammograms for RSNA and VinDr Datasets. This step is necessary to convert the dicom images into a png format compatible with our paper. Also, we uploaded the VinDr png images here. If you download the png images for VinDr from the above link, don't preprocess for the VinDr dataset again.

Metadata Files (including the annotations and attributes)

The metadata files to train the classifier are provided here:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

📚 Table of Contents (Datasets Section)

📁 Dataset Directory Structure

📥 Dataset Download

✅ Automated Download (Waterbirds & MetaShift)

📎 Manual Download Required

Preprocessing Mammograms

Metadata Files (including the annotations and attributes)

FilesExpand file tree

dataset_zoo.md

Latest commit

History

dataset_zoo.md

File metadata and controls

📚 Table of Contents (Datasets Section)

📁 Dataset Directory Structure

📥 Dataset Download

✅ Automated Download (Waterbirds & MetaShift)

📎 Manual Download Required

Preprocessing Mammograms

Metadata Files (including the annotations and attributes)