Skip to content

Accept both ptb/ptb_xl and mimic/mimic_iv as splitter keys#3

Open
TonyChen06 wants to merge 1 commit intoELM-Research:mainfrom
TonyChen06:fix-splitter-path-aliases
Open

Accept both ptb/ptb_xl and mimic/mimic_iv as splitter keys#3
TonyChen06 wants to merge 1 commit intoELM-Research:mainfrom
TonyChen06:fix-splitter-path-aliases

Conversation

@TonyChen06
Copy link
Copy Markdown

TLDR: I think the HuggingFace datasets (tested for the relevant datasets) don't actually split patient-wise. Please double check yourself but for me it doesn't work. If so, this is probably the root cause. vv

Splitter._PATIENT_EXTRACTORS is keyed on "ptb" and "mimic", which only matches inputs whose ecg_path segments are data/ptb/... or data/mimic/.... Inputs that use the longer form (data/ptb_xl/..., data/mimic_iv/...) hit _PATIENT_EXTRACTORS.get(...) → None, so _patient_id returns None for every row, every row gets appended to loose as a singleton group in split_dataset, and the train/test split silently degenerates to row-level random shuffling. The post-split assertion doesn't catch this because the comprehensions building train_pids / test_pids filter out None with if splitter._patient_id(...), leaving both sets empty so the intersection is trivially empty. Empirically, willxxy's HF parquets — which use the long form — show 99.9% / 91% / 40% / 57% test-ECG-in-train overlap on ecg-qa-ptbxl-250-2500, ecg-qa-mimic-iv-ecg-250-2500, ecg-instruct-pulse-250-2500, ecg-instruct-45k-250-2500 respectively (leakage rate scales with questions-per-ECG). Fix is to add aliases so both path conventions resolve to the same extractor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant