Skip to content

Avoid using iterrows, use vectorization wherever possible #120

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 3 commits into
base: dev
Choose a base branch
from

Conversation

aditya0by0
Copy link
Member

@aditya0by0 aditya0by0 requested a review from sfluegel05 August 9, 2025 15:46
@aditya0by0 aditya0by0 self-assigned this Aug 9, 2025
@aditya0by0 aditya0by0 added priority: low Issue with low priority bug:fix enhancement New feature or request and removed bug:fix labels Aug 9, 2025
@aditya0by0 aditya0by0 marked this pull request as ready for review August 9, 2025 17:05
Comment on lines -249 to +253
for index, row in data_frame.iterrows():
for row in data_frame.itertuples(index=False):
train_data.append(
[
data_frame.iloc[index].values[1],
data_frame.iloc[index].values[2:502].tolist(),
row.SMILES,
row.LABELS,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about this lines of code, whether this was the actual change intended

@aditya0by0 aditya0by0 changed the title Avoid using iterrows Avoid using iterrows + vectorization for performance Aug 9, 2025
@aditya0by0 aditya0by0 changed the title Avoid using iterrows + vectorization for performance Avoid using iterrows, use vectorization wherever possible Aug 9, 2025
@aditya0by0 aditya0by0 marked this pull request as draft August 10, 2025 10:28
Copy link
Collaborator

@sfluegel05 sfluegel05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for implementing this. I created a new dataset with these changes and it worked (although there was no major performance boost, because the most time-intensive part is the split generation).

@aditya0by0
Copy link
Member Author

ok, I have few minor changes which I will commit later. Will mark the PR ready for review once done.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority: low Issue with low priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace df.iterrows() with df.itertuples() for Significant Performance Gains
2 participants