Skip to content

Conversation

jaclark5
Copy link
Collaborator

New Submission Checklist

  • Created a new folder in the submissions directory containing the dataset
  • Added README.md describing the dataset see here for examples
  • All files used to produce the dataset are included with a description
  • [NA] Dataset follows the QCSubmit schema defined for Datasets, OptimizationDatasets and TorsionDriveDatasets
  • Dataset filename matches pattern dataset*.json; may feature a compression extension, such as .bz2
  • [NA] A PDF depicting the molecules is attached, in the case of torsiondrives this should include the highlighting of the central bond, this can be done automatically using qcsubmit.
  • QCSubmit validation passed
  • Made a new dataset entry in the mapping table in repository README.md
  • Ready to submit!

@jaclark5 jaclark5 requested a review from lilyminium August 22, 2025 12:08
@jaclark5
Copy link
Collaborator Author

@chrisiacovella before we merge this I'll need links to the source HDF5 for full posterity. I think there's enough here for @lilyminium to take a pass at reviewing, with the knowledge that links to the files will be provided.

@openff-dangerbot
Copy link
Contributor

QCSubmit Validation Report

submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-Pd-Zn-Fe-Cu-low-mw-v0.0/scaffold.json.bz2
Dataset Name tmQM xtb Dataset T=100K Pd Zn Fe Cu low mw v0.0
Dataset Type singlepoint
Elements F ,N ,Cl ,Cu ,Pd ,S ,P ,H ,Fe ,Zn ,Br ,C ,O
Valid Cmiles
Connected Dihedrals 🔥
No Linear Torsions 🔥
No Molecular Complexes
Valid Constraints
Total Charge 🔥
Valid Coordinates 🔥
Complete Metatdata

QC Specification Report

submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-Pd-Zn-Fe-Cu-low-mw-v0.0/scaffold.json.bz2/BP86/def2-TZVP
Specification Name BP86/def2-TZVP
Method bp86
Basis def2-tzvp
Wavefunction Protocol
Implicit Solvent
Keywords {"maxiter": 500, "scf_properties": ["dipole", "quadrupole", "wiberg_lowdin_indices", "mayer_indices", "lowdin_charges", "mulliken_charges"], "function_kwargs": {"properties": ["dipole_polarizabilities"]}}
Validated 🔥
Valid SCF Properties 🔥
Full Basis Coverage 🔥
QCSubmit version information(click to expand)
version
openff.qcsubmit 0.56.0
openff.toolkit 0.17.0
basis_set_exchange 0.11
qcelemental 0.28.0
rdkit 2025.03.5

@openff-dangerbot
Copy link
Contributor

QCSubmit Validation Report

submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-Pd-Zn-Fe-Cu-low-mw-v0.0/scaffold.json.bz2
Dataset Name tmQM xtb Dataset T=100K Pd Zn Fe Cu low mw v0.0
Dataset Type singlepoint
Elements F ,N ,Cl ,Cu ,Pd ,S ,P ,H ,Fe ,Zn ,Br ,C ,O
Valid Cmiles
Connected Dihedrals 🔥
No Linear Torsions 🔥
No Molecular Complexes
Valid Constraints
Total Charge 🔥
Valid Coordinates 🔥
Complete Metatdata 🔥

QC Specification Report

submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-Pd-Zn-Fe-Cu-low-mw-v0.0/scaffold.json.bz2/BP86/def2-TZVP
Specification Name BP86/def2-TZVP
Method bp86
Basis def2-tzvp
Wavefunction Protocol
Implicit Solvent
Keywords {"maxiter": 500, "scf_properties": ["dipole", "quadrupole", "wiberg_lowdin_indices", "mayer_indices", "lowdin_charges", "mulliken_charges"], "function_kwargs": {"properties": ["dipole_polarizabilities"]}}
Validated 🔥
Valid SCF Properties 🔥
Full Basis Coverage 🔥
QCSubmit version information(click to expand)
version
openff.qcsubmit 0.56.0
openff.toolkit 0.17.0
basis_set_exchange 0.11
qcelemental 0.28.0
rdkit 2025.03.5

Copy link
Contributor

@lilyminium lilyminium left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM, thanks Jen! My only question is why there are 471097 conformers but 480339 entry names in the dataset? Are there duplicates?

Approving so as to not block if I've missed something obvious.

In the future could you please add a semicolon to suppress output when adding entries -- 400k integers adds a lot of scrolling to the notebook!

Edit: also, why is CI failing on cmiles?

@jaclark5
Copy link
Collaborator Author

jaclark5 commented Aug 25, 2025

Mostly LGTM, thanks Jen! My only question is why there are 471097 conformers but 480339 entry names in the dataset? Are there duplicates?

@lilyminium There are duplicate initial coordinates but at a different multiplicity. I supposed those will converge to a different geometry, so I could report it as 480,339 conformers.

In the future could you please add a semicolon to suppress output when adding entries -- 400k integers adds a lot of scrolling to the notebook!

Thanks for pointing this out! I didn't see that locally or in the files changed view on GitHub, I'll be sure to look at the raw file in the future.

Edit: also, why is CI failing on cmiles?

It's because TMOS is not always able to make a valid SMILES string, so those entries are without, causing an error.

@lilyminium
Copy link
Contributor

I see -- thanks! Yeah I think 480339 is clearer and lets users know what dataset size to expect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants