Accept both ptb/ptb_xl and mimic/mimic_iv as splitter keys#3
Open
TonyChen06 wants to merge 1 commit intoELM-Research:mainfrom
Open
Accept both ptb/ptb_xl and mimic/mimic_iv as splitter keys#3TonyChen06 wants to merge 1 commit intoELM-Research:mainfrom
TonyChen06 wants to merge 1 commit intoELM-Research:mainfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TLDR: I think the HuggingFace datasets (tested for the relevant datasets) don't actually split patient-wise. Please double check yourself but for me it doesn't work. If so, this is probably the root cause. vv
Splitter._PATIENT_EXTRACTORS is keyed on "ptb" and "mimic", which only matches inputs whose ecg_path segments are data/ptb/... or data/mimic/.... Inputs that use the longer form (data/ptb_xl/..., data/mimic_iv/...) hit _PATIENT_EXTRACTORS.get(...) → None, so _patient_id returns None for every row, every row gets appended to loose as a singleton group in split_dataset, and the train/test split silently degenerates to row-level random shuffling. The post-split assertion doesn't catch this because the comprehensions building train_pids / test_pids filter out None with if splitter._patient_id(...), leaving both sets empty so the intersection is trivially empty. Empirically, willxxy's HF parquets — which use the long form — show 99.9% / 91% / 40% / 57% test-ECG-in-train overlap on ecg-qa-ptbxl-250-2500, ecg-qa-mimic-iv-ecg-250-2500, ecg-instruct-pulse-250-2500, ecg-instruct-45k-250-2500 respectively (leakage rate scales with questions-per-ECG). Fix is to add aliases so both path conventions resolve to the same extractor.