diff --git a/README.md b/README.md
index ea555ea1..eb16c025 100644
--- a/README.md
+++ b/README.md
@@ -261,8 +261,9 @@ These are currently used to compute properties of a minimum energy conformation
 | `Curated tmQM-xtb Dataset: T=100K Dataset Restricted to Pd, Zn, Fe, Cu v0.0` | [2025-03-17-Curated-tmQM-xtb-Dataset-T=100K-Dataset-Restricted-to-Pd-Zn-Fe-Cu-v0.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2025-03-17-Curated-tmQM-xtb-Dataset-T=100K-Dataset-Restricted-to-Pd-Zn-Fe-Cu-v0.0) | BP86/def2-TZVP Conformers for single metal complexes with Pd, Fe, Zn, Cu, Mg, Li and change of {-1,0,+1} |  Br, C, Cl, Cu, F, Fe, H, N, O, P, Pd, S, Zn ||
 | `OpenFF Cresset Additional Coverage Hessian v4.0` | [2025-03-31-OpenFF-Cresset-Additional-Coverage-Hessian-v4.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2025-03-31-OpenFF-Cresset-Additional-Coverage-Hessian-v4.0) | Hessian single points for the final molecules in the [OpenFF Cresset Additional Coverage Optimizations v4.0 dataset](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2025-03-06-OpenFF-Cresset-Additional-Coverage-Optimizations-v4.0) |  O, C, F, S, H, N, Br, Cl ||
 | `OpenFF Optimization Hessians 2019-07 to 2025-03 v4.0` | [2025-04-14-OpenFF-Optimization-Hessians-2019-07-to-2025-03-v4.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2025-04-14-OpenFF-Optimization-Hessians-2019-07-to-2025-03-v4.0) | Hessian single points for the final molecules in OpenFF optimization datasets from 2019-07 to 2025-03 |  S, H, O, Br, F, N, P, Cl, I, C ||
-| `OpenFF CX3-CX4 singlepoints v4.0"` | [2025-05-21-OpenFF-CX3-CX4-singlepoints-v4.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2025-05-21-OpenFF-CX3-CX4-singlepoints-v4.0) | Single-points of molecules where Sage 2.2.1 torsions t17 and t18 have been driven |  Br, C, Cl, F, H, I, N, O, S ||
+| `OpenFF CX3-CX4 singlepoints v4.0` | [2025-05-21-OpenFF-CX3-CX4-singlepoints-v4.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2025-05-21-OpenFF-CX3-CX4-singlepoints-v4.0) | Single-points of molecules where Sage 2.2.1 torsions t17 and t18 have been driven |  Br, C, Cl, F, H, I, N, O, S ||
 |`MLPepper RECAP Optimized Fragments v1.1`| [2025-07-01-MLPepper-RECAP-Optimized-Fragments-v1.1](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2025-07-01-MLPepper-RECAP-Optimized-Fragments-v1.1) | Single point property calculations for charge models, expanded to include iodine | P ,B ,Cl ,Br ,C ,H ,I ,F ,O ,N ,Si ,S | |
+| `tmQM xtb Dataset T=100K low-mw high-coordinate mult=1 v0.0` | [2025-08-14-tmQM-xtb-Dataset-T=100K-low-mw-high-coordinate-mult=1-v0.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-low-mw-high-coordinate-mult=1-v0.0) | BP86/def2-TZVP Conformers for single metal complexes with Pd, Fe, Zn, Cu, and change of {-1,0,+1} and multiplicity of 1. MW <= 600 Da, generally high coordinate and a max of 31 geometry samples |  Br, C, Cl, Cu, F, Fe, H, N, O, P, Pd, S, Zn ||
 
 
 
diff --git a/submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-low-mw-high-coordinate-mult=1-v0.0/README.md b/submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-low-mw-high-coordinate-mult=1-v0.0/README.md
new file mode 100644
index 00000000..8005942c
--- /dev/null
+++ b/submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-low-mw-high-coordinate-mult=1-v0.0/README.md
@@ -0,0 +1,61 @@
+# tmQM xtb Dataset T=100K low-mw high-coordinate mult=1 v0.0
+
+### Description
+
+This dataset was generated starting from an adaptation of the tmQM dataset (https://zenodo.org/records/17042449). 
+This dataset contains 10,235 unique systems with 306,993 total configurations / spin states below 600 Da.  The molecules are 
+limited to containing transition metals Pd, Zn, Fe, or Cu, and also only contain elements Br, C, H, P, S, O, N, F, Cl, 
+or Br with charges: {-1,0,+1}. The metal is restricted to greater than three coordination sites for Pd, four for Fe, 
+and one for Cu and Zn. Each molecule was preprocessed using gfn2-xtb, and then a short MD simulation
+performed to provide a maximum of 30 off-optimum configurations in addition to the minimized geometry per molecules at 
+a multiplicity of 1. This singlepoint dataset was then run with the BP86/def2-TZVP for with those geometries from molecular 
+dynamics using gfn-xtb. Each configuration is reported with the following properties: 'energy', 'gradient', 'dipole', 'quadrupole',
+'wiberg_lowdin_indices', 'mayer_indices', 'lowdin_charges' 'dipole_polarizabilities', 'mulliken_charges'. SMILES
+strings where generated from tmos (https://github.com/openforcefield/tmos) when possible. These SMILES strings can be
+imported into RDKit for initial visualization, but will not reflect the coordinate geometries presented from tmQm.
+
+### General Information
+
+- Date: 2025-08-14
+- Purpose: BP86/def2-TZVP Conformers for single metal complexes with Pd, Fe, Zn, Cu, and change of {-1,0,+1} and multiplicity of 1. MW <= 600 Da, generally high coordinate, and a max of 31 geometry samples
+- Dataset Type: singlepoint
+- Name: tmQM xtb Dataset T=100K low-mw high-coordinate mult=1 v0.0
+- Number of unique molecules: 10,235
+- Number of filtered molecules: 0
+- Number of Conformers: 306,993
+- Number of conformers (min mean max): 3, 30, 31
+- Molecular Weight (min mean max): 95 462 600
+- Set of charges: -1.0, 0.0, 1.0
+- Dataset Submitter: Jennifer A. Clark
+- Dataset Curator: Christopher R. Iacovella
+
+### QCSubmit generation pipeline
+
+- `generate-dataset.ipynb`: A python notebook which shows how the dataset was prepared from the input files.
+
+### QCSubmit Manifest
+
+- `generate-dataset.ipynb`
+- `environment.yml`: Conda environment file to perform this workflow
+- `environment_full.yml`: All installed packages with versions for successful completion of this workflow
+- `scaffold.json.bz2`: A compressed json file of the target dataset
+ 
+### Metadata
+
+* Elements: {'Br', 'C', 'Cl', 'Cu', 'F', 'Fe', 'H', 'N', 'O', 'P', 'Pd', 'S', 'Zn'}
+* QC Specifications: BP86/def2-TZVP
+  * program: psi4
+  * method: BP86
+  * basis: def2-TZVP
+  * driver: gradient
+  * implicit_solvent: None
+  * keywords: {}
+  * maxiter: 500
+  * SCF Properties:
+    * dipole
+    * quadrupole
+    * wiberg_lowdin_indices
+    * mayer_indices
+    * lowdin_charges
+    * dipole_polarizabilities
+    * mulliken_charges
\ No newline at end of file
diff --git a/submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-low-mw-high-coordinate-mult=1-v0.0/environment.yml b/submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-low-mw-high-coordinate-mult=1-v0.0/environment.yml
new file mode 100644
index 00000000..68b40750
--- /dev/null
+++ b/submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-low-mw-high-coordinate-mult=1-v0.0/environment.yml
@@ -0,0 +1,22 @@
+name: qca-clean
+channels:
+  - conda-forge
+dependencies:
+  - python=3.11
+  - numpy
+  - jupyter
+  - pandas
+  - h5py
+  - periodictable
+  - qcportal>=0.61
+  - qcfractal>=0.61
+  - qcfractalcompute>=0.59
+  - rdkit>=2025.3.3
+  - openbabel
+  - deepdiff
+  - py3Dmol
+  - scipy
+  - networkx
+  - pip:
+      - git+https://github.com/openforcefield/tmos.git
+      - git+https://github.com/MDAnalysis/mdanalysis.git@develop#subdirectory=package # delete after 2.10 is released
\ No newline at end of file
diff --git a/submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-low-mw-high-coordinate-mult=1-v0.0/environment_full.yaml b/submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-low-mw-high-coordinate-mult=1-v0.0/environment_full.yaml
new file mode 100644
index 00000000..a31ee57d
--- /dev/null
+++ b/submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-low-mw-high-coordinate-mult=1-v0.0/environment_full.yaml
@@ -0,0 +1,242 @@
+name: qca-clean
+channels:
+  - conda-forge
+dependencies:
+  - anyio=4.10.0=pyhe01879c_0
+  - appnope=0.1.4=pyhd8ed1ab_1
+  - argon2-cffi=25.1.0=pyhd8ed1ab_0
+  - argon2-cffi-bindings=25.1.0=py311h3696347_0
+  - arrow=1.3.0=pyhd8ed1ab_1
+  - asttokens=3.0.0=pyhd8ed1ab_1
+  - async-lru=2.0.5=pyh29332c3_0
+  - attrs=25.3.0=pyh71513ae_0
+  - babel=2.17.0=pyhd8ed1ab_0
+  - beautifulsoup4=4.13.4=pyha770c72_0
+  - bleach=6.2.0=pyh29332c3_4
+  - bleach-with-css=6.2.0=h82add2a_4
+  - brotli=1.1.0=h5505292_3
+  - brotli-bin=1.1.0=h5505292_3
+  - brotli-python=1.1.0=py311h155a34a_3
+  - bzip2=1.0.8=h99b78c6_7
+  - c-ares=1.34.5=h5505292_0
+  - ca-certificates=2025.8.3=hbd8a1cb_0
+  - cached-property=1.5.2=hd8ed1ab_1
+  - cached_property=1.5.2=pyha770c72_1
+  - cairo=1.18.4=h6a3b0d2_0
+  - certifi=2025.8.3=pyhd8ed1ab_0
+  - cffi=1.17.1=py311h3a79f62_0
+  - chardet=5.2.0=pyhd8ed1ab_3
+  - charset-normalizer=3.4.3=pyhd8ed1ab_0
+  - comm=0.2.3=pyhe01879c_0
+  - contourpy=1.3.3=py311h57a9ea7_1
+  - cycler=0.12.1=pyhd8ed1ab_1
+  - cyrus-sasl=2.1.28=ha1cbb27_0
+  - debugpy=1.8.16=py311ha59bd64_0
+  - decorator=5.2.1=pyhd8ed1ab_0
+  - deepdiff=8.6.0=pyhe01879c_0
+  - defusedxml=0.7.1=pyhd8ed1ab_0
+  - exceptiongroup=1.3.0=pyhd8ed1ab_0
+  - executing=2.2.0=pyhd8ed1ab_0
+  - font-ttf-dejavu-sans-mono=2.37=hab24e00_0
+  - font-ttf-inconsolata=3.000=h77eed37_0
+  - font-ttf-source-code-pro=2.038=h77eed37_0
+  - font-ttf-ubuntu=0.83=h77eed37_3
+  - fontconfig=2.15.0=h1383a14_1
+  - fonts-conda-ecosystem=1=0
+  - fonts-conda-forge=1=0
+  - fonttools=4.59.1=py311h2fe624c_0
+  - fqdn=1.5.1=pyhd8ed1ab_1
+  - freetype=2.13.3=hce30654_1
+  - freetype-py=2.3.0=pyhd8ed1ab_0
+  - greenlet=3.2.4=py311hf719da1_0
+  - h11=0.16.0=pyhd8ed1ab_0
+  - h2=4.2.0=pyhd8ed1ab_0
+  - h5py=3.14.0=nompi_py311h8470beb_100
+  - hdf5=1.14.6=nompi_he65715a_103
+  - hpack=4.1.0=pyhd8ed1ab_0
+  - httpcore=1.0.9=pyh29332c3_0
+  - httpx=0.28.1=pyhd8ed1ab_0
+  - hyperframe=6.1.0=pyhd8ed1ab_0
+  - icu=75.1=hfee45f7_0
+  - idna=3.10=pyhd8ed1ab_1
+  - importlib-metadata=8.7.0=pyhe01879c_1
+  - ipykernel=6.30.1=pyh92f572d_0
+  - ipython=9.4.0=pyhfa0c392_0
+  - ipython_pygments_lexers=1.1.1=pyhd8ed1ab_0
+  - ipywidgets=8.1.7=pyhd8ed1ab_0
+  - isoduration=20.11.0=pyhd8ed1ab_1
+  - jedi=0.19.2=pyhd8ed1ab_1
+  - jinja2=3.1.6=pyhd8ed1ab_0
+  - json5=0.12.1=pyhd8ed1ab_0
+  - jsonpointer=3.0.0=py311h267d04e_1
+  - jsonschema=4.25.0=pyhe01879c_0
+  - jsonschema-specifications=2025.4.1=pyh29332c3_0
+  - jsonschema-with-format-nongpl=4.25.0=he01879c_0
+  - jupyter=1.1.1=pyhd8ed1ab_1
+  - jupyter-lsp=2.2.6=pyhe01879c_0
+  - jupyter_client=8.6.3=pyhd8ed1ab_1
+  - jupyter_console=6.6.3=pyhd8ed1ab_1
+  - jupyter_core=5.8.1=pyh31011fe_0
+  - jupyter_events=0.12.0=pyh29332c3_0
+  - jupyter_server=2.16.0=pyhe01879c_0
+  - jupyter_server_terminals=0.5.3=pyhd8ed1ab_1
+  - jupyterlab=4.4.6=pyhd8ed1ab_0
+  - jupyterlab_pygments=0.3.0=pyhd8ed1ab_2
+  - jupyterlab_server=2.27.3=pyhd8ed1ab_1
+  - jupyterlab_widgets=3.0.15=pyhd8ed1ab_0
+  - kiwisolver=1.4.9=py311h63e5c0c_0
+  - krb5=1.21.3=h237132a_0
+  - lark=1.2.2=pyhd8ed1ab_1
+  - lcms2=2.17=h7eeda09_0
+  - lerc=4.0.0=hd64df32_1
+  - libaec=1.1.4=h51d1e36_0
+  - libblas=3.9.0=34_h10e41b3_openblas
+  - libboost=1.86.0=hc9fb7c5_3
+  - libboost-python=1.86.0=py311h8fc16d6_3
+  - libbrotlicommon=1.1.0=h5505292_3
+  - libbrotlidec=1.1.0=h5505292_3
+  - libbrotlienc=1.1.0=h5505292_3
+  - libcblas=3.9.0=34_hb3479ef_openblas
+  - libcurl=8.14.1=h73640d1_0
+  - libcxx=20.1.8=hf598326_1
+  - libdeflate=1.24=h5773f1b_0
+  - libedit=3.1.20250104=pl5321hafb1f1b_0
+  - libev=4.33=h93a5062_2
+  - libexpat=2.7.1=hec049ff_0
+  - libffi=3.4.6=h1da3d7d_1
+  - libfreetype=2.13.3=hce30654_1
+  - libfreetype6=2.13.3=h1d14073_1
+  - libgfortran=15.1.0=hfdf1602_0
+  - libgfortran5=15.1.0=hb74de2c_0
+  - libglib=2.84.3=h587fa63_0
+  - libiconv=1.18=h23cfdf5_2
+  - libintl=0.25.1=h493aca8_0
+  - libjpeg-turbo=3.1.0=h5505292_0
+  - liblapack=3.9.0=34_hc9a63f6_openblas
+  - liblzma=5.8.1=h39f12f2_2
+  - libnghttp2=1.64.0=h6d7220d_0
+  - libntlm=1.8=h5505292_0
+  - libopenblas=0.3.30=openmp_h60d53f8_1
+  - libpng=1.6.50=h280e0eb_1
+  - libpq=17.6=h6846fd6_0
+  - librdkit=2025.03.5=hafd8b29_0
+  - libsodium=1.0.20=h99b78c6_0
+  - libsqlite=3.50.3=hf8de324_1
+  - libssh2=1.11.1=h1590b86_0
+  - libtiff=4.7.0=h025e3ab_6
+  - libwebp-base=1.6.0=h07db88b_0
+  - libxcb=1.17.0=hdb1d25a_0
+  - libxml2=2.13.8=h4a9ca0c_1
+  - libzlib=1.3.1=h8359307_2
+  - llvm-openmp=20.1.8=hbb9b287_1
+  - markupsafe=3.0.2=py311h4921393_1
+  - matplotlib-base=3.10.5=py311h66dac5a_0
+  - matplotlib-inline=0.1.7=pyhd8ed1ab_1
+  - mistune=3.1.3=pyh29332c3_0
+  - munkres=1.1.4=pyhd8ed1ab_1
+  - nbclient=0.10.2=pyhd8ed1ab_0
+  - nbconvert-core=7.16.6=pyh29332c3_0
+  - nbformat=5.10.4=pyhd8ed1ab_1
+  - ncurses=6.5=h5e97a16_3
+  - nest-asyncio=1.6.0=pyhd8ed1ab_1
+  - networkx=3.5=pyhe01879c_0
+  - notebook=7.4.5=pyhd8ed1ab_0
+  - notebook-shim=0.2.4=pyhd8ed1ab_1
+  - numpy=2.3.2=py311h0856f98_0
+  - openbabel=3.1.1=py311h292ccdb_9
+  - openjpeg=2.5.3=h889cd5d_1
+  - openldap=2.6.10=hbe55e7a_0
+  - openssl=3.5.2=he92f556_0
+  - orderly-set=5.5.0=pyhe01879c_0
+  - overrides=7.7.0=pyhd8ed1ab_1
+  - packaging=25.0=pyh29332c3_1
+  - pandas=2.3.1=py311hff7e5bb_0
+  - pandocfilters=1.5.0=pyhd8ed1ab_0
+  - parso=0.8.4=pyhd8ed1ab_1
+  - pcre2=10.45=ha881caa_0
+  - periodictable=1.7.1=pyhd8ed1ab_0
+  - pexpect=4.9.0=pyhd8ed1ab_1
+  - pickleshare=0.7.5=pyhd8ed1ab_1004
+  - pillow=11.3.0=py311hb9ba9e9_0
+  - pip=25.2=pyh8b19718_0
+  - pixman=0.46.4=h81086ad_1
+  - platformdirs=4.3.8=pyhe01879c_0
+  - prometheus_client=0.22.1=pyhd8ed1ab_0
+  - prompt-toolkit=3.0.51=pyha770c72_0
+  - prompt_toolkit=3.0.51=hd8ed1ab_0
+  - psutil=7.0.0=py311h917b07b_0
+  - pthread-stubs=0.4=hd74edd7_1002
+  - ptyprocess=0.7.0=pyhd8ed1ab_1
+  - pure_eval=0.2.3=pyhd8ed1ab_1
+  - py3dmol=2.5.2=pyhd8ed1ab_0
+  - pycairo=1.28.0=py311h8a0deb1_0
+  - pycparser=2.22=pyh29332c3_1
+  - pygments=2.19.2=pyhd8ed1ab_0
+  - pyobjc-core=11.1=py311hf0763de_0
+  - pyobjc-framework-cocoa=11.1=py311hab620ed_0
+  - pyparsing=3.2.3=pyhe01879c_2
+  - pysocks=1.7.1=pyha55dd90_7
+  - python=3.11.13=hc22306f_0_cpython
+  - python-dateutil=2.9.0.post0=pyhe01879c_2
+  - python-fastjsonschema=2.21.1=pyhd8ed1ab_0
+  - python-json-logger=2.0.7=pyhd8ed1ab_0
+  - python-tzdata=2025.2=pyhd8ed1ab_0
+  - python_abi=3.11=8_cp311
+  - pytz=2025.2=pyhd8ed1ab_0
+  - pyyaml=6.0.2=py311h4921393_2
+  - pyzmq=27.0.1=py311h2637eca_0
+  - qhull=2020.2=h420ef59_5
+  - rdkit=2025.03.5=py311h1da7121_0
+  - readline=8.2=h1d1bf99_2
+  - referencing=0.36.2=pyh29332c3_0
+  - reportlab=4.4.1=py311h917b07b_0
+  - requests=2.32.4=pyhd8ed1ab_0
+  - rfc3339-validator=0.1.4=pyhd8ed1ab_1
+  - rfc3986-validator=0.1.1=pyh9f0ad1d_0
+  - rfc3987-syntax=1.1.0=pyhe01879c_1
+  - rlpycairo=0.2.0=pyhd8ed1ab_0
+  - rpds-py=0.27.0=py311h1c3fc1a_0
+  - scipy=1.16.1=py311hffedffa_0
+  - send2trash=1.8.3=pyh31c8845_1
+  - setuptools=80.9.0=pyhff2d567_0
+  - six=1.17.0=pyhe01879c_1
+  - sniffio=1.3.1=pyhd8ed1ab_1
+  - soupsieve=2.7=pyhd8ed1ab_0
+  - sqlalchemy=2.0.43=py311h3696347_0
+  - stack_data=0.6.3=pyhd8ed1ab_1
+  - terminado=0.18.1=pyh31c8845_0
+  - tinycss2=1.4.0=pyhd8ed1ab_0
+  - tk=8.6.13=h892fb3f_2
+  - tomli=2.2.1=pyhe01879c_2
+  - tornado=6.5.2=py311h3696347_0
+  - traitlets=5.14.3=pyhd8ed1ab_1
+  - types-python-dateutil=2.9.0.20250809=pyhd8ed1ab_0
+  - typing-extensions=4.14.1=h4440ef1_0
+  - typing_extensions=4.14.1=pyhe01879c_0
+  - typing_utils=0.1.0=pyhd8ed1ab_1
+  - tzdata=2025b=h78e105d_0
+  - unicodedata2=16.0.0=py311h917b07b_0
+  - uri-template=1.3.0=pyhd8ed1ab_1
+  - urllib3=2.5.0=pyhd8ed1ab_0
+  - wcwidth=0.2.13=pyhd8ed1ab_1
+  - webcolors=24.11.1=pyhd8ed1ab_0
+  - webencodings=0.5.1=pyhd8ed1ab_3
+  - websocket-client=1.8.0=pyhd8ed1ab_1
+  - wheel=0.45.1=pyhd8ed1ab_1
+  - widgetsnbextension=4.0.14=pyhd8ed1ab_0
+  - xorg-libxau=1.0.12=h5505292_0
+  - xorg-libxdmcp=1.1.5=hd74edd7_0
+  - yaml=0.2.5=h925e9cb_3
+  - zeromq=4.3.5=hc1bb282_7
+  - zipp=3.23.0=pyhd8ed1ab_0
+  - zstandard=0.23.0=py311h917b07b_2
+  - zstd=1.5.7=h6491c7d_2
+  - pip:
+    - MDAnalysis==2.10.0.dev0
+    - qcarchivetesting==0.62.post11+g5735f6503
+    - qcfractal==0.62.post11+g5735f6503
+    - qcfractalcompute==0.62.post11+g5735f6503
+    - qcportal==0.62.post11+g5735f6503
+    - tmos==1.0.0+33.g2b4f7f8
+
+prefix: "/Users/jenniferclark/mamba/envs/qca-clean"
diff --git a/submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-low-mw-high-coordinate-mult=1-v0.0/generate_dataset.ipynb b/submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-low-mw-high-coordinate-mult=1-v0.0/generate_dataset.ipynb
new file mode 100644
index 00000000..ecba9eb2
--- /dev/null
+++ b/submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-low-mw-high-coordinate-mult=1-v0.0/generate_dataset.ipynb
@@ -0,0 +1,470 @@
+{
+ "cells": [
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "application/vnd.jupyter.widget-view+json": {
+       "model_id": "f294053bff1a44e29f8c08e8f3b1004b",
+       "version_major": 2,
+       "version_minor": 0
+      },
+      "text/plain": []
+     },
+     "metadata": {},
+     "output_type": "display_data"
+    }
+   ],
+   "source": [
+    "import contextlib\n",
+    "from datetime import date\n",
+    "from collections import Counter, defaultdict\n",
+    "import warnings\n",
+    "\n",
+    "import periodictable\n",
+    "import h5py\n",
+    "import numpy as np\n",
+    "\n",
+    "import qcportal\n",
+    "from qcportal.external import scaffold\n",
+    "from qcportal.molecules import Molecule\n",
+    "from qcportal.singlepoint import SinglepointDriver, QCSpecification\n",
+    "from qcelemental.physical_constants import constants\n",
+    "\n",
+    "import tmos\n",
+    "warnings.filterwarnings(\"ignore\", module=\"tmos\")\n",
+    "\n",
+    "ADDRESS = \"https://api.qcarchive.molssi.org:443\"\n",
+    "#qc_client = qcportal.PortalClient(ADDRESS, cache_dir=\".\")\n",
+    "from qcfractal.snowflake import FractalSnowflake\n",
+    "import warnings\n",
+    "snowflake = FractalSnowflake()\n",
+    "client = snowflake.client()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#!aria2c \"https://zenodo.org/records/17042449/files/tmqm_dataset_xtb_T100_raw_ext.hdf5.gz?download=1\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Helper Functions"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def remove_extraneous_dimension(array):\n",
+    "    shape = list(np.shape(array))\n",
+    "    if 1 in shape:\n",
+    "        shape.remove(1)\n",
+    "    return np.array(array).reshape(shape)\n",
+    "\n",
+    "def get_symbols(atomic_numbers):\n",
+    "    return [str(periodictable.elements[x])for x in remove_extraneous_dimension(atomic_numbers)]\n",
+    "\n",
+    "def get_molecular_formula(atomic_numbers):\n",
+    "    return \"\".join([str(y) for x1, x2 in Counter(get_symbols(atomic_numbers)).items() for y in [x1, x2] if y != 1])\n",
+    "\n",
+    "def get_molecular_weight(atomic_numbers):\n",
+    "    return sum(periodictable.elements[x].mass for x in remove_extraneous_dimension(atomic_numbers))\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def apply_mapping(mapping, input_dict, index=0, lx=None):\n",
+    "\n",
+    "    output = defaultdict(dict)\n",
+    "    for key, value in mapping.items():\n",
+    "        if isinstance(value, str):\n",
+    "            data = input_dict[value]\n",
+    "#            print(value, np.shape(data), lx)\n",
+    "            if not isinstance(data, str):\n",
+    "                #if key == \"molecular_multiplicity\" and np.size(data) > 1: # sometimes multiplicity is [3,3] for some reason\n",
+    "                #    data = data[-1]\n",
+    "                if key == \"geometry\" and np.shape(data)[0] != lx: # update number of frames\n",
+    "                    lx = np.shape(data)[0]\n",
+    "                \n",
+    "                if key != \"geometry\":\n",
+    "                    data = remove_extraneous_dimension(data)\n",
+    "\n",
+    "#                print(value, np.shape(data), lx)\n",
+    "                if lx is not None: # and len(np.shape(data)) > 1:\n",
+    "                    if len(data) == lx:\n",
+    "                        output[key] = data[index]\n",
+    "                        continue\n",
+    "                    else:\n",
+    "                        raise ValueError(f\"Expected {lx} configuration, but {len(data)} are present\")\n",
+    "            output[key] = data\n",
+    "        elif isinstance(value, tuple): # function, input pairs\n",
+    "            output[key] = value[0](*(input_dict[k2] for k2 in value[1:]))\n",
+    "        elif isinstance(value, list):\n",
+    "            output[key].update({k2: input_dict[k2] for k2 in value})\n",
+    "        elif isinstance(value, dict):\n",
+    "            output[key].update(apply_mapping(value, input_dict, index=0, lx=None))\n",
+    "            \n",
+    "    return output\n",
+    "            \n",
+    "def convert_hdf5_group(hdf5_group):\n",
+    "    output = {}\n",
+    "    for key, value in hdf5_group.items():\n",
+    "        if isinstance(value, h5py.Group):\n",
+    "            output[key] = convert_hdf5_group(value)\n",
+    "        elif isinstance(value, h5py.Dataset):\n",
+    "            data = value[()]\n",
+    "            if isinstance(data, np.ndarray):\n",
+    "                output[key] = data\n",
+    "            elif isinstance(data, np.bytes_):\n",
+    "                output[key] = data.decode('utf-8')  # Convert to string\n",
+    "            else:\n",
+    "                output[key] = data.item() if isinstance(data, np.generic) else data  # Convert NumPy scalars\n",
+    "        else:\n",
+    "            output[key] = value\n",
+    "\n",
+    "    return output"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Assembled Dataset"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "dataset_name = \"tmQM xtb Dataset T=100K low-mw high-coordinate mult=1 v0.0\"\n",
+    "tagline = \"BP86/def2-TZVP Conformers for single metal complexes with Pd, Fe, Zn, Cu, and change of {-1,0,+1} and multiplicity of 1. MW <= 600 Da, generally high coordinate, and a max of 31 geometry samples\"\n",
+    "description = (\"\"\"\n",
+    "This dataset was generated starting from an adaptation of the tmQM dataset (https://zenodo.org/records/17042449). \n",
+    "This dataset contains 10,235 unique systems with 306,993 total configurations / spin states below 600 Da.  The molecules are \n",
+    "limited to containing transition metals Pd, Zn, Fe, or Cu, and also only contain elements Br, C, H, P, S, O, N, F, Cl, \n",
+    "or Br with charges: {-1,0,+1}. The metal is restricted to greater than three coordination sites for Pd, four for Fe, \n",
+    "and one for Cu and Zn. Each molecule was preprocessed using gfn2-xtb, and then a short MD simulation\n",
+    "performed to provide a maximum of 30 off-optimum configurations in addition to the minimized geometry per molecules at \n",
+    "a multiplicity of 1. This singlepoint dataset was then run with the BP86/def2-TZVP for with those geometries from molecular \n",
+    "dynamics using gfn-xtb. Each configuration is reported with the following properties: 'energy', 'gradient', 'dipole', 'quadrupole',\n",
+    "'wiberg_lowdin_indices', 'mayer_indices', 'lowdin_charges' 'dipole_polarizabilities', 'mulliken_charges'. SMILES\n",
+    "strings where generated from tmos (https://github.com/openforcefield/tmos) when possible. These SMILES strings can be\n",
+    "imported into RDKit for initial visualization, but will not reflect the coordinate geometries presented from tmQm.\n",
+    "\"\"\")\n",
+    "\n",
+    "dataset = client.add_dataset( # https://docs.qcarchive.molssi.org/user_guide/qcportal_reference.html\n",
+    "    \"singlepoint\", # collection type\n",
+    "    dataset_name, # Dataset name\n",
+    "    tagline=tagline,\n",
+    "    description=description,\n",
+    "    tags=[\"openff\"],\n",
+    "    provenance={\n",
+    "        \"qcportal\": qcportal.__version__,\n",
+    "    },\n",
+    "    default_tag=\"openff\",\n",
+    "    extras={\n",
+    "        \"submitter\": \"jaclark5\",\n",
+    "        \"creation_date\": date.today(),\n",
+    "        'collection_type': 'SinglepointDataset',\n",
+    "        \"long_description\": description,\n",
+    "        'long_description_url': f'https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/{dataset_name.replace(\" \", \"-\")}',\n",
+    "        \"short_description\": tagline,\n",
+    "        \"dataset_name\": dataset_name,\n",
+    "    },\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "/Users/jenniferclark/bin/mdanalysis/package/MDAnalysis/converters/RDKitInferring.py:109: UserWarning: '_MDAnalysis_index' not available on the input mol atoms, skipping reordering of atoms.\n",
+      "  warnings.warn(\n",
+      "/Users/jenniferclark/bin/mdanalysis/package/MDAnalysis/converters/RDKitInferring.py:633: UserWarning: The standardization could not be completed within a reasonable number of iterations\n",
+      "  warnings.warn(\n"
+     ]
+    }
+   ],
+   "source": [
+    "METALS_SYMBOLS = [periodictable.elements[x].symbol for x in tmos.reference_values.METALS_NUM]\n",
+    "\n",
+    "hdf5_mapping = {\n",
+    "    \"symbols\": (get_symbols, \"atomic_numbers\"), \n",
+    "    \"geometry\": \"geometry\",\n",
+    "    \"molecular_charge\": \"total_charge\",\n",
+    "    \"molecular_multiplicity\": \"spin_multiplicity\",\n",
+    "    \"identifiers\": {\"molecular_formula\": (get_molecular_formula, \"atomic_numbers\"),},\n",
+    "    \"extras\": {'molecular_weight': (get_molecular_weight, \"atomic_numbers\")},\n",
+    "}\n",
+    "\n",
+    "elements, molecular_weights, charges = [], [], []\n",
+    "conformers = Counter()\n",
+    "count_molecules = 0\n",
+    "\n",
+    "errors = defaultdict(list)\n",
+    "errors_mult = []\n",
+    "errors_misc = defaultdict(lambda: defaultdict(list))\n",
+    "failed_metals = defaultdict(lambda: 0)\n",
+    "count_no = 0\n",
+    "\n",
+    "mult = 1\n",
+    "hdf5 = h5py.File(f\"tmqm_dataset_xtb_T100_raw_ext.hdf5\", 'r')\n",
+    "for ii, (label, mol_hdf5) in enumerate(hdf5.items()):\n",
+    "    if f\"sm{mult}\" not in label:\n",
+    "        continue\n",
+    "\n",
+    "    mol_dict = convert_hdf5_group(mol_hdf5)\n",
+    "    lx = mol_dict[\"n_configs\"]\n",
+    "    \n",
+    "    ## Decide to filter\n",
+    "    try:\n",
+    "        input = apply_mapping(hdf5_mapping, mol_dict, index=0, lx=lx-1)\n",
+    "    except Exception as e:\n",
+    "        continue\n",
+    "    input[\"geometry\"] *= 10 # Convert from nm to Angstroms\n",
+    "    \n",
+    "    rdmol_draft = tmos.build_rdmol.xyz_to_rdkit(input[\"symbols\"], input[\"geometry\"], ignore_scale=True)\n",
+    "    tm_idx = tmos.find_metal_index(rdmol_draft)\n",
+    "    _, n_bonds = tmos.geometry.get_geometry_from_mol(rdmol_draft, tm_idx)\n",
+    "    if (\n",
+    "        (n_bonds < 5 and rdmol_draft.GetAtoms()[tm_idx].GetSymbol() == \"Fe\") or\n",
+    "        (n_bonds < 4 and rdmol_draft.GetAtoms()[tm_idx].GetSymbol() == \"Pd\") or \n",
+    "        n_bonds < 2\n",
+    "    ):\n",
+    "        count_no += 1\n",
+    "        continue\n",
+    "    try:\n",
+    "        result  = tmos.sanitize_complex(rdmol_draft, value_missing_coord=np.nan)\n",
+    "    except Exception as e:\n",
+    "        metal = [x for x in input['symbols'] if x in METALS_SYMBOLS][0]\n",
+    "        errors[str(e)[:40].strip()].append([label, metal, e, tmos.utils.first_traceback()])\n",
+    "        result = None\n",
+    "    ## Import conformers\n",
+    "    for i in range(lx-1):\n",
+    "        # Get values from HDF5\n",
+    "        qc_input = apply_mapping(hdf5_mapping, mol_dict, index=i, lx=lx-1)\n",
+    "        qc_input[\"geometry\"] *= 10 / constants.bohr2angstroms # Convert from nm to Bohr (a0)\n",
+    "        if result is not None:\n",
+    "            for key in result.keys():\n",
+    "                if result[key][\"complex_info\"][\"total_charge\"] == qc_input[\"molecular_charge\"]:\n",
+    "                    qc_input[\"identifiers\"][\"smiles\"] = result[key][\"complex_info\"][\"smiles\"]\n",
+    "    \n",
+    "        try:\n",
+    "            molecule = Molecule(\n",
+    "                name=label,\n",
+    "                fix_com=True,\n",
+    "                fix_orientation=True,\n",
+    "                fix_symmetry=\"c1\",\n",
+    "                comment=\"Molecule coordinates taken from tmQM and SMILES from tmos\",\n",
+    "                **qc_input\n",
+    "            )\n",
+    "            dataset.add_entry(name=label+f\"_{i}\", molecule=molecule)\n",
+    "            count_molecules += 1\n",
+    "            conformers[label[:-4]] += 1\n",
+    "        except Exception as e:\n",
+    "            if \"Inconsistent or unspecified chg/mult\" in str(e):\n",
+    "                errors_mult.append(label)\n",
+    "            else:\n",
+    "                errors_misc[str(e)[:30]][label].append([i, str(e)])\n",
+    "            continue\n",
+    "\n",
+    "        elements.extend(list(set(qc_input['symbols'])))\n",
+    "        molecular_weights.append(qc_input['extras'][\"molecular_weight\"])\n",
+    "        charges.append(qc_input[\"molecular_charge\"])\n",
+    "\n",
+    "dataset.extras[\"elements\"] = sorted(list(set(elements)))\n",
+    "        "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Number of molecules removed for solvent assessment: 3134\n",
+      "Number of molecules removed for unspecified chg/mult: 0\n",
+      "Number of conformers accepted: 306993\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(f\"Number of molecules removed for solvent assessment: {count_no}\")\n",
+    "print(f\"Number of molecules removed for unspecified chg/mult: {len(errors_mult)}\")\n",
+    "print(f\"Number of conformers accepted: {len(dataset.entry_names)}\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "306993 conformers were imported.\n",
+      "\n",
+      "The following errors DO remove molecules from the dataset:\n",
+      "\n",
+      "There were 269 molecules of 306993 that failed to create SMILES.\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(f\"{len(dataset.entry_names)} conformers were imported.\")\n",
+    "\n",
+    "print(\"\\nThe following errors DO remove molecules from the dataset:\")\n",
+    "for err, values in errors_misc.items():\n",
+    "    print(f\"    {len(values)}: '{err}'\")\n",
+    "\n",
+    "print(f\"\\nThere were {sum([len(x) for x in errors.values()])} molecules of {len(dataset.entry_names)} that failed to create SMILES.\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "[]"
+      ]
+     },
+     "execution_count": 9,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "errors_misc[\"SinglepointDataset.add_entry()\"]['ABACAL_sm1']"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "InsertMetadata(error_description=None, errors=[], inserted_idx=[0], existing_idx=[])"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "spec = QCSpecification(\n",
+    "        program='psi4',\n",
+    "        driver=SinglepointDriver.gradient,\n",
+    "        method='BP86',\n",
+    "        basis='def2-TZVP',\n",
+    "        keywords={\n",
+    "            'maxiter': 500, \n",
+    "            'scf_properties': ['dipole', 'quadrupole', 'wiberg_lowdin_indices', 'mayer_indices', 'lowdin_charges', 'mulliken_charges'],\n",
+    "            'function_kwargs': {'properties': ['dipole_polarizabilities']},\n",
+    "        },\n",
+    "        protocols={'wavefunction': 'none'}\n",
+    "    )\n",
+    "dataset.add_specification(name=\"BP86/def2-TZVP\", specification=spec)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "scaffold.to_json(dataset, compress=True)\n",
+    "#dataset.submit()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Make Outputs"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Elements: ['Br', 'C', 'Cl', 'Cu', 'F', 'Fe', 'H', 'N', 'O', 'P', 'Pd', 'S', 'Zn']\n",
+      "Charges: [-1.0, 0.0, 1.0]\n",
+      "Molecular Weight (min mean max): 95 462 600\n",
+      "Number of Molecules: 10235\n",
+      "Number of Conformers: 306993\n",
+      "Number of conformers (min mean max): 3 30 31\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"Elements:\", dataset.extras[\"elements\"])\n",
+    "print(\"Charges:\", sorted(set(charges)))\n",
+    "print(\"Molecular Weight (min mean max):\", int(np.min(molecular_weights)), int(np.mean(molecular_weights)), int(np.max(molecular_weights)))\n",
+    "            \n",
+    "print(\"Number of Molecules:\", len(conformers))\n",
+    "print(\"Number of Conformers:\", sum(conformers.values()))\n",
+    "n_conformers = np.array(list(conformers.values())) + 1\n",
+    "print(\"Number of conformers (min mean max):\", int(np.min(n_conformers)), int(np.mean(n_conformers)), int(np.max(n_conformers)))"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "qca",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.12"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-low-mw-high-coordinate-mult=1-v0.0/scaffold.json.bz2 b/submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-low-mw-high-coordinate-mult=1-v0.0/scaffold.json.bz2
new file mode 100644
index 00000000..68f8b4bd
--- /dev/null
+++ b/submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-low-mw-high-coordinate-mult=1-v0.0/scaffold.json.bz2
@@ -0,0 +1,3 @@
+version https://git-lfs.github.com/spec/v1
+oid sha256:11a906e153914624a99169856c325bfcb28209c9ccb4fbd25a023c6efdc4c400
+size 229164257