Skip to content

Conversation

tanishy7777
Copy link
Contributor

@tanishy7777 tanishy7777 commented Apr 16, 2025

Fixes #4130

Changes made in this Pull Request:

  • Moved the capped distance calculation after the filtering using the between keyword.
  • Now we loop through all pairs of atom groups which we get from the between keyword. Then get the donor and acceptors based on the distance cutoff (capped distance). Then combine the results at the end.

PR Checklist

  • Issue raised/referenced?
  • Tests updated/added?
  • Documentation updated/added?
  • package/CHANGELOG file updated?
  • Is your name in package/AUTHORS? (If it is not, add it!)

Developers Certificate of Origin

I certify that I can submit this code contribution as described in the Developer Certificate of Origin, under the MDAnalysis LICENSE.


📚 Documentation preview 📚: https://mdanalysis--5029.org.readthedocs.build/en/5029/

@tanishy7777
Copy link
Contributor Author

tanishy7777 commented Apr 16, 2025

Timing Benchmark comparing the original and this implementation

Improved implementation:
image
Original Implementation:
image

Copy link

codecov bot commented Apr 16, 2025

Codecov Report

Attention: Patch coverage is 90.16393% with 6 lines in your changes missing coverage. Please review.

Project coverage is 93.61%. Comparing base (af9848b) to head (8ce4165).
Report is 1 commits behind head on develop.

Files with missing lines Patch % Lines
...DAnalysis/analysis/hydrogenbonds/hbond_analysis.py 90.16% 2 Missing and 4 partials ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #5029      +/-   ##
===========================================
- Coverage    93.62%   93.61%   -0.02%     
===========================================
  Files          177      177              
  Lines        21995    22037      +42     
  Branches      3112     3124      +12     
===========================================
+ Hits         20593    20629      +36     
- Misses         947      949       +2     
- Partials       455      459       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@tylerjereddy
Copy link
Member

A more standard approach to benchmarking would be to use asv--I suppose we don't have such benchmarks written for the H-bond analysis yet based on git grep of our benchmarks folder contents, though other examples are there.

@tanishy7777
Copy link
Contributor Author

tanishy7777 commented Apr 24, 2025

A more standard approach to benchmarking would be to use asv--I suppose we don't have such benchmarks written for the H-bond analysis yet based on git grep of our benchmarks folder contents, though other examples are there.

Yep, ASV benchmarks for hbond analysis isn't there, so I just used the timeit library.
Should I write a benchmark for the between keyword to assess the efficiency before and after the changes?

@tylerjereddy
Copy link
Member

Maybe wait for someone else to comment on that. I know that upstream we usually don't block PRs because asv benchmarks don't exist yet. I just tend to find it way easier to see the results in a common approach/format, and if I'm feeling skeptical I can then run them locally and see for myself.

@tanishy7777
Copy link
Contributor Author

Maybe wait for someone else to comment on that. I know that upstream we usually don't block PRs because asv benchmarks don't exist yet. I just tend to find it way easier to see the results in a common approach/format, and if I'm feeling skeptical I can then run them locally and see for myself.

Thanks for clarifying! That makes sense, adding ASV benchmarks does make the performance results easier to compare.

Adding benchmarks for the Hbond analysis module is actually something I included in my GSoC proposal(pending results), so while I'm not starting that yet, I would be happy to add them if others think it’s useful for this PR.

@orbeckst
Copy link
Member

@tanishy7777 are you still interested in continuing the PR?

Did I read your preliminary benchmark correctly in that your changes improve execution from ~12.5s to ~11s ? That's not an enormous improvement but it's better. As long as the code is still correct and not harder to read/maintain, I'd be supportive of such a improvement.

Having and ASV benchmark would be neat but I agree with @tylerjereddy that this would not be a blocker.

@orbeckst
Copy link
Member

@p-j-smith would you be able to look after this PR, if @tanishy7777 were to continue working on it?

@orbeckst
Copy link
Member

@p-j-smith If you don't have time, please un-assign yourself.

If you take on PR-shepherding, feel free to close the PR once you consider it stale.

Copy link
Member

@p-j-smith p-j-smith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @tanishy7777 for tackling this!

Adding benchmakrs with asv would be nice but it's not a blocker. But it would be good to know how you did the benchmarking (size of the system, number of frames, atom selections, arguments passed to HydrogenBondAnalysis etc.).

Comment on lines +718 to +720
"No hydrogen bonds were found given d-a cutoff of "
f"{self.d_a_cutoff} between Donor, {self.donors_sel}, and "
f"Acceptor, {self.acceptors_sel}."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this warning is no longer accurate - it should warn that no hydrogen bonds were found between groups 1 and 2 for the given cutoff and atom selections

Comment on lines +724 to +736
tmp_donors = donors_for_dist[d_a_indices_12.T[0]]
tmp_acceptors = acceptors_for_dist[d_a_indices_12.T[1]]
distance = d_a_distances_12

# Fetch hydrogens bonded to the tmp_donorss
# TODO: figure out vectorization
for i in range(len(d_a_indices_12)):
if tmp_donors[i].index in donor_idx_to_H_indices:
possible_H_indices = donor_idx_to_H_indices[tmp_donors[i].index]
for h_idx in possible_H_indices:
tmp_hbond_candidates.append(
(tmp_donors[i].index, h_idx, tmp_acceptors[i].index, distance[i])
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this work?

hydrogens_for_dist = self._hydrogens[tmp_donors_g1_mask]
tmp_hydrogens = self._hydrogens[d_a_indices.T[0]]

self._donors and self._hydrogens have the same size, and are paired (i.e. if we want to keep donors tmp_donors_g1_mask we also want to keep hydrogens tmp_donors_g1_mask.

box=box,
return_distances=True,
)
if self.between_ags is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please could you move everything in this if block into a separate function to make _single_frame a little easier to read?

Comment on lines +703 to +706
tmp_donors_g1_mask = np.isin(self._donors.indices, group1.indices)
donors_for_dist = self._donors[tmp_donors_g1_mask]
tmp_acceptors_g2_mask = np.isin(self._acceptors.indices, group2.indices)
acceptors_for_dist = self._acceptors[tmp_acceptors_g2_mask]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

                tmp_donors_mask = np.logical_or(
                    np.isin(self._donors.indices, group1.indices),
                    np.isin(self._donors.indices, group2.indices),
                 )
                donors_for_dist = self._donors[tmp_donors_mask] 
                tmp_acceptors_mask = np.logical_or(
                    np.isin(self._acceptors.indices, group1.indices),
                    np.isin(self._acceptors.indices, group2.indices),
                ) 
                acceptors_for_dist = self._acceptors[tmp_acceptors_mask] 

I think if you do this you only need to make one call to capped_distance per pair of groups, rather than calculating distances between g1 donors to g2 acceptors and then g2 donors to g1 donors

Comment on lines +774 to +778
unique_hbond_dict = {}
for d_idx, h_idx, a_idx, dist in tmp_hbond_candidates:
key = (d_idx, h_idx, a_idx)
if key not in unique_hbond_dict:
unique_hbond_dict[key] = dist
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if you do tmp_hydrogens = self._hydrogens[d_a_indices.T[0]] as suggested above, then all the hydrogen bonds will be unique anyway and you won't need this dictionary

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

potential performance improvements for H-bond analysis 'between'
4 participants