-
Notifications
You must be signed in to change notification settings - Fork 727
Performance Improvement for hydrogenbond analysis for the between
keyword
#5029
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #5029 +/- ##
===========================================
- Coverage 93.62% 93.61% -0.02%
===========================================
Files 177 177
Lines 21995 22037 +42
Branches 3112 3124 +12
===========================================
+ Hits 20593 20629 +36
- Misses 947 949 +2
- Partials 455 459 +4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
A more standard approach to benchmarking would be to use |
Yep, ASV benchmarks for hbond analysis isn't there, so I just used the |
Maybe wait for someone else to comment on that. I know that upstream we usually don't block PRs because |
Thanks for clarifying! That makes sense, adding ASV benchmarks does make the performance results easier to compare. Adding benchmarks for the Hbond analysis module is actually something I included in my GSoC proposal(pending results), so while I'm not starting that yet, I would be happy to add them if others think it’s useful for this PR. |
@tanishy7777 are you still interested in continuing the PR? Did I read your preliminary benchmark correctly in that your changes improve execution from ~12.5s to ~11s ? That's not an enormous improvement but it's better. As long as the code is still correct and not harder to read/maintain, I'd be supportive of such a improvement. Having and ASV benchmark would be neat but I agree with @tylerjereddy that this would not be a blocker. |
@p-j-smith would you be able to look after this PR, if @tanishy7777 were to continue working on it? |
@p-j-smith If you don't have time, please un-assign yourself. If you take on PR-shepherding, feel free to close the PR once you consider it stale. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @tanishy7777 for tackling this!
Adding benchmakrs with asv
would be nice but it's not a blocker. But it would be good to know how you did the benchmarking (size of the system, number of frames, atom selections, arguments passed to HydrogenBondAnalysis
etc.).
"No hydrogen bonds were found given d-a cutoff of " | ||
f"{self.d_a_cutoff} between Donor, {self.donors_sel}, and " | ||
f"Acceptor, {self.acceptors_sel}." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this warning is no longer accurate - it should warn that no hydrogen bonds were found between groups 1 and 2 for the given cutoff and atom selections
tmp_donors = donors_for_dist[d_a_indices_12.T[0]] | ||
tmp_acceptors = acceptors_for_dist[d_a_indices_12.T[1]] | ||
distance = d_a_distances_12 | ||
|
||
# Fetch hydrogens bonded to the tmp_donorss | ||
# TODO: figure out vectorization | ||
for i in range(len(d_a_indices_12)): | ||
if tmp_donors[i].index in donor_idx_to_H_indices: | ||
possible_H_indices = donor_idx_to_H_indices[tmp_donors[i].index] | ||
for h_idx in possible_H_indices: | ||
tmp_hbond_candidates.append( | ||
(tmp_donors[i].index, h_idx, tmp_acceptors[i].index, distance[i]) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this work?
hydrogens_for_dist = self._hydrogens[tmp_donors_g1_mask]
tmp_hydrogens = self._hydrogens[d_a_indices.T[0]]
self._donors
and self._hydrogens
have the same size, and are paired (i.e. if we want to keep donors tmp_donors_g1_mask
we also want to keep hydrogens tmp_donors_g1_mask
.
box=box, | ||
return_distances=True, | ||
) | ||
if self.between_ags is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please could you move everything in this if
block into a separate function to make _single_frame
a little easier to read?
tmp_donors_g1_mask = np.isin(self._donors.indices, group1.indices) | ||
donors_for_dist = self._donors[tmp_donors_g1_mask] | ||
tmp_acceptors_g2_mask = np.isin(self._acceptors.indices, group2.indices) | ||
acceptors_for_dist = self._acceptors[tmp_acceptors_g2_mask] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tmp_donors_mask = np.logical_or(
np.isin(self._donors.indices, group1.indices),
np.isin(self._donors.indices, group2.indices),
)
donors_for_dist = self._donors[tmp_donors_mask]
tmp_acceptors_mask = np.logical_or(
np.isin(self._acceptors.indices, group1.indices),
np.isin(self._acceptors.indices, group2.indices),
)
acceptors_for_dist = self._acceptors[tmp_acceptors_mask]
I think if you do this you only need to make one call to capped_distance
per pair of groups, rather than calculating distances between g1 donors to g2 acceptors and then g2 donors to g1 donors
unique_hbond_dict = {} | ||
for d_idx, h_idx, a_idx, dist in tmp_hbond_candidates: | ||
key = (d_idx, h_idx, a_idx) | ||
if key not in unique_hbond_dict: | ||
unique_hbond_dict[key] = dist |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if you do tmp_hydrogens = self._hydrogens[d_a_indices.T[0]]
as suggested above, then all the hydrogen bonds will be unique anyway and you won't need this dictionary
Fixes #4130
Changes made in this Pull Request:
between
keyword. Then get the donor and acceptors based on the distance cutoff (capped distance). Then combine the results at the end.PR Checklist
package/CHANGELOG
file updated?package/AUTHORS
? (If it is not, add it!)Developers Certificate of Origin
I certify that I can submit this code contribution as described in the Developer Certificate of Origin, under the MDAnalysis LICENSE.
📚 Documentation preview 📚: https://mdanalysis--5029.org.readthedocs.build/en/5029/