Skip to content

Conversation

@marinegor
Copy link
Contributor

@marinegor marinegor commented Sep 20, 2024

Fixes #2367 and also extends #4303 and solves #5089

Changes made in this Pull Request:

  • uses gemmi library (link) to parse mmcif files
  • adds a class MMCIFReader(base.SingleFrameReaderBase) and class MMCIFParser(TopologyReaderBase) classes for that

As a bonus, this implementation would potentially allow to read any of the gemmi-supported formats (source):

  • mmCIF (PDBx/mmCIF),
  • PDB (with popular extensions),
  • mmJSON

Also, this (with slight modifications) also would allow reading mmcif with multiple models sharing the same topology, as well as more feature-rich parsing of PDBs (the same code without changes can be used for parsing altlocs, charges, etc, from all of these formats).

However, I'm slightly lost on what's to be done next for this PR to be merged, so I'm asking if someone could help me navigate here (tagging @richardjgowers here as author of original PDBx implementation 4303).

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

Developers certificate of origin


📚 Documentation preview 📚: https://mdanalysis--4712.org.readthedocs.build/en/4712/

@pep8speaks
Copy link

pep8speaks commented Sep 20, 2024

Hello @marinegor! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

Line 28:80: E501 line too long (84 > 79 characters)
Line 41:80: E501 line too long (85 > 79 characters)
Line 42:80: E501 line too long (93 > 79 characters)
Line 61:80: E501 line too long (104 > 79 characters)
Line 65:80: E501 line too long (87 > 79 characters)
Line 67:80: E501 line too long (107 > 79 characters)

Line 2:24: W291 trailing whitespace
Line 60:80: E501 line too long (111 > 79 characters)
Line 72:80: E501 line too long (123 > 79 characters)
Line 82:80: E501 line too long (122 > 79 characters)
Line 106:80: E501 line too long (108 > 79 characters)
Line 113:80: E501 line too long (80 > 79 characters)
Line 128:80: E501 line too long (91 > 79 characters)
Line 175:80: E501 line too long (126 > 79 characters)
Line 185:80: E501 line too long (125 > 79 characters)
Line 224:80: E501 line too long (126 > 79 characters)
Line 242:80: E501 line too long (140 > 79 characters)
Line 281:80: E501 line too long (87 > 79 characters)
Line 292:80: E501 line too long (90 > 79 characters)

Line 56:80: E501 line too long (80 > 79 characters)
Line 57:80: E501 line too long (84 > 79 characters)

Line 335:26: W292 no newline at end of file

Line 48:80: E501 line too long (103 > 79 characters)
Line 81:80: E501 line too long (80 > 79 characters)
Line 97:80: E501 line too long (86 > 79 characters)
Line 271:80: E501 line too long (90 > 79 characters)
Line 340:80: E501 line too long (104 > 79 characters)
Line 387:80: E501 line too long (83 > 79 characters)
Line 436:80: E501 line too long (80 > 79 characters)
Line 463:80: E501 line too long (80 > 79 characters)
Line 481:80: E501 line too long (80 > 79 characters)
Line 493:80: E501 line too long (80 > 79 characters)
Line 494:80: E501 line too long (80 > 79 characters)
Line 497:80: E501 line too long (83 > 79 characters)
Line 498:80: E501 line too long (86 > 79 characters)
Line 546:80: E501 line too long (82 > 79 characters)
Line 547:80: E501 line too long (82 > 79 characters)
Line 549:80: E501 line too long (88 > 79 characters)
Line 551:80: E501 line too long (88 > 79 characters)
Line 552:80: E501 line too long (81 > 79 characters)
Line 777:80: E501 line too long (81 > 79 characters)
Line 778:80: E501 line too long (87 > 79 characters)
Line 779:80: E501 line too long (84 > 79 characters)
Line 780:80: E501 line too long (85 > 79 characters)
Line 781:80: E501 line too long (83 > 79 characters)

Comment last updated at 2024-10-25 11:17:29 UTC

@github-actions
Copy link

github-actions bot commented Sep 20, 2024

Linter Bot Results:

Hi @marinegor! Thanks for making this PR. We linted your code and found the following:

Some issues were found with the formatting of your code.

Code Location Outcome
main package ⚠️ Possible failure
testsuite ⚠️ Possible failure

Please have a look at the darker-main-code and darker-test-code steps here for more details: https://github.com/MDAnalysis/mdanalysis/actions/runs/11148966346/job/30986736623


Please note: The black linter is purely informational, you can safely ignore these outcomes if there are no flake8 failures!

Copy link
Member

@richardjgowers richardjgowers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far, will require a small test file to check reader/parser halves.

np.array,
list(
zip(
*[
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm struggling to follow the logic here, a comment breaking down what this double nested loop iteration into a zip is doing would be nice

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a little comment explaining that

"pytng>=0.2.3",
"gsd>3.0.0",
"rdkit>=2020.03.1",
"gemmi", # for mmcif format
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will probably be optional, so other imports will have to respect that too

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what to do with that -- it's already in the [project.optional-dependencies] table, is there an example of making it more optional?

@marinegor
Copy link
Contributor Author

thanks to amazing contribution by @PardhavMaradani it seems that the current version is working now! Seems that some extra attributes and/or my own implementation of change_squash were messing things up, and I'm very relieved to see it go.

I've re-requested reviews from @orbeckst @yuxuanzhuang @ljwoods2.

One open question that I have: technically, this topology/coordinates parser can also parse PDB without any changes to it, like that:

import MDAnalysis as mda
mda.Universe('testsuite/MDAnalysisTests/data/mmcif/1BD2.pdb.gz', format='mmcif')
# <Universe with 6378 atoms>

do we want to add this option to the existing PDBParser (e.g. introduce there a backend keyword), or document it here somehow? I feel like it's something very powerful (and potentially more compliant with reading from RCSB's pdb files directly), and would like to have in mdanalysis somehow. But I can also open a separate issue with this discussion.

@orbeckst
Copy link
Member

orbeckst commented Nov 5, 2025

I'd open a separate issue for the PDB discussion – it will only delay this one. If it can be treated separately then do it separately. (Orthogonality is great!)

I don't think I have time to review so please don't wait for me.

-------
MDAnalysis Topology object
"""
structure = gemmi.read_structure(self.filename)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bumping so this doesn't get lost, still think this should accept a path obj (unless I'm missing something)

@marinegor
Copy link
Contributor Author

marinegor commented Nov 6, 2025 via email

@ljwoods2
Copy link
Contributor

@marinegor pushed directly here and then reverted, oops. Let me know what you think of the PR, this will allow passing streams, pathlib.Paths, etc, as the existing PDBParser can already handle (also via openany)

@marinegor marinegor requested a review from ljwoods2 November 17, 2025 17:54
@marinegor
Copy link
Contributor Author

not sure who to ping for review -- I feel like @yuxuanzhuang @richardjgowers and @ljwoods2 you've been participating in the process, so you might be able to do it after some months as well? :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

PDBx/mmCIF Reader/Topology Reader

10 participants