Skip to content

Conversation

prateekdesai04
Copy link
Collaborator

@prateekdesai04 prateekdesai04 commented Oct 23, 2023

Description of changes:
This PR handles the case where if multiple cleaned CSVs having been run on different folds are being evaluated.
Initially evaluation was only possible if all were using same number of folds.
This sets the folds to the least of all the cleaned CSVs being evaluated.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Comment on lines +160 to +168
dataframes = []
for path in paths:
path = path if is_s3_url(path) else os.path.join(self.results_dir_input, path)
dataframe = pd.read_csv(path)
dataframes.append(dataframe)
# Discarding extra folds
min_num_rows = min(len(df) for df in dataframes)
trimmed_dataframes = [df[:min_num_rows] for df in dataframes]
return pd.concat(trimmed_dataframes, ignore_index=True, sort=True)
Copy link
Contributor

@Innixma Innixma Oct 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will not discard extra folds properly. Please add a unit test and separate out the filtering logic so it is not hard-coded into the load_results_raw method.

  1. Not all DataFrames loaded will have the same number of methods or datasets, so trimming by length of rows will not work.
  2. We don't want to always filter extra folds. This should be a post-load operation that is optional.
  3. You are assuming the input file is sorted by fold. This is not a valid assumption.

Copy link
Contributor

@Innixma Innixma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to above comment

dataframe = pd.read_csv(path)
dataframes.append(dataframe)
# Discarding extra folds
min_num_rows = min(len(df) for df in dataframes)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if there are multiple datasets in results file? min() will not do what it's intended right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants