Skip to content

Latest commit

 

History

History
102 lines (86 loc) · 4.73 KB

File metadata and controls

102 lines (86 loc) · 4.73 KB

📚 Table of Contents (Datasets Section)

📁 Dataset Directory Structure

We follow the directory structure below for the datasets used in this project. The datasets are organized into subdirectories, each containing the necessary files for training and evaluation.

data/
├── celeba/
│   ├── img_align_celeba/
│   ├── list_attr_celeba.csv
│   ├── list_bbox_celeba.csv
│   ├── list_eval_partition.csv
│   ├── list_landmarks_align_celeba.csv
│   ├── metadata_celeba.csv
│   ├── va_metadata_celeba_captioning_blip.csv
│   └── va_metadata_celeba_captioning_GPT.csv
├── metashift/
│   ├── metadata_metashift.csv
│   ├── metadata_metashift_captioning.csv
│   ├── te_metadata_metashift_captioning.csv
│   ├── va_metadata_metashift_captioning_blip.csv
│   ├── va_metadata_metashift_captioning_gpt.csv
│   ├── va_metadata_metashift_captioning_GPT.csv
│   └── MetaShift-Cat-Dog-indoor-outdoor/
│       ├── train/
│       └── test/
├── nih/
│   ├── mimic-cxr-chexpert.csv
│   └── nih_processed_v2.csv
├── RSNA_Cancer_Detection/
│   └── rsna_w_upmc_concepts_breast_clip.csv
├── Vindr/
│   └── vindr-mammo-a-large-scale-benchmark-dataset-for-computer-aided-detection-and-diagnosis-in-full-field-digital-mammography-1.0.0/
│       ├── breast-level_annotations.csv
│       ├── finding_annotations.csv
│       └── vindr_detection_v1_folds_abnormal.csv
└── waterbirds/
    ├── metadata_waterbirds.csv
    ├── va_metadata_waterbirds_captioning_blip.csv
    ├── va_metadata_waterbirds_captioning_GPT.csv
    └── waterbird_complete95_forest2water2/
        ├── 001.Black_footed_Albatross/
        ├── 002.Laysan_Albatross/
        ├── 003.Sooty_Albatross/
        ├── ...
        ├── 200.Common_Yellowthroat/
        └── metadata.csv

📥 Dataset Download

We rely heavily on the Subpopulation Shift Benchmark (SubpopBench) codebase for downloading and processing datasets. Necessary compatibility modifications are included in our repo under src/codebase/SubpopBench-main:

✅ Automated Download (Waterbirds & MetaShift)

Use the following command to download the Waterbirds and MetaShift datasets:

python ./src/codebase/SubpopBench-main/subpopbench/scripts/download.py \
  --datasets "waterbirds" "metashift" \
  --data_path "Ladder/data/new" \
  --download True

📎 Manual Download Required

The following datasets must be downloaded manually from their official sources:

Preprocessing Mammograms

Follow the steps in Mammo-CLIP codebase to preprocess the mammograms for RSNA and VinDr Datasets. This step is necessary to convert the dicom images into a png format compatible with our paper. Also, we uploaded the VinDr png images here. If you download the png images for VinDr from the above link, don't preprocess for the VinDr dataset again.

Metadata Files (including the annotations and attributes)

The metadata files to train the classifier are provided here: