diff --git a/README.md b/README.md index ea555ea1..eb16c025 100644 --- a/README.md +++ b/README.md @@ -261,8 +261,9 @@ These are currently used to compute properties of a minimum energy conformation | `Curated tmQM-xtb Dataset: T=100K Dataset Restricted to Pd, Zn, Fe, Cu v0.0` | [2025-03-17-Curated-tmQM-xtb-Dataset-T=100K-Dataset-Restricted-to-Pd-Zn-Fe-Cu-v0.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2025-03-17-Curated-tmQM-xtb-Dataset-T=100K-Dataset-Restricted-to-Pd-Zn-Fe-Cu-v0.0) | BP86/def2-TZVP Conformers for single metal complexes with Pd, Fe, Zn, Cu, Mg, Li and change of {-1,0,+1} | Br, C, Cl, Cu, F, Fe, H, N, O, P, Pd, S, Zn || | `OpenFF Cresset Additional Coverage Hessian v4.0` | [2025-03-31-OpenFF-Cresset-Additional-Coverage-Hessian-v4.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2025-03-31-OpenFF-Cresset-Additional-Coverage-Hessian-v4.0) | Hessian single points for the final molecules in the [OpenFF Cresset Additional Coverage Optimizations v4.0 dataset](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2025-03-06-OpenFF-Cresset-Additional-Coverage-Optimizations-v4.0) | O, C, F, S, H, N, Br, Cl || | `OpenFF Optimization Hessians 2019-07 to 2025-03 v4.0` | [2025-04-14-OpenFF-Optimization-Hessians-2019-07-to-2025-03-v4.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2025-04-14-OpenFF-Optimization-Hessians-2019-07-to-2025-03-v4.0) | Hessian single points for the final molecules in OpenFF optimization datasets from 2019-07 to 2025-03 | S, H, O, Br, F, N, P, Cl, I, C || -| `OpenFF CX3-CX4 singlepoints v4.0"` | [2025-05-21-OpenFF-CX3-CX4-singlepoints-v4.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2025-05-21-OpenFF-CX3-CX4-singlepoints-v4.0) | Single-points of molecules where Sage 2.2.1 torsions t17 and t18 have been driven | Br, C, Cl, F, H, I, N, O, S || +| `OpenFF CX3-CX4 singlepoints v4.0` | [2025-05-21-OpenFF-CX3-CX4-singlepoints-v4.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2025-05-21-OpenFF-CX3-CX4-singlepoints-v4.0) | Single-points of molecules where Sage 2.2.1 torsions t17 and t18 have been driven | Br, C, Cl, F, H, I, N, O, S || |`MLPepper RECAP Optimized Fragments v1.1`| [2025-07-01-MLPepper-RECAP-Optimized-Fragments-v1.1](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2025-07-01-MLPepper-RECAP-Optimized-Fragments-v1.1) | Single point property calculations for charge models, expanded to include iodine | P ,B ,Cl ,Br ,C ,H ,I ,F ,O ,N ,Si ,S | | +| `tmQM xtb Dataset T=100K low-mw high-coordinate mult=1 v0.0` | [2025-08-14-tmQM-xtb-Dataset-T=100K-low-mw-high-coordinate-mult=1-v0.0](https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-low-mw-high-coordinate-mult=1-v0.0) | BP86/def2-TZVP Conformers for single metal complexes with Pd, Fe, Zn, Cu, and change of {-1,0,+1} and multiplicity of 1. MW <= 600 Da, generally high coordinate and a max of 31 geometry samples | Br, C, Cl, Cu, F, Fe, H, N, O, P, Pd, S, Zn || diff --git a/submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-low-mw-high-coordinate-mult=1-v0.0/README.md b/submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-low-mw-high-coordinate-mult=1-v0.0/README.md new file mode 100644 index 00000000..8005942c --- /dev/null +++ b/submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-low-mw-high-coordinate-mult=1-v0.0/README.md @@ -0,0 +1,61 @@ +# tmQM xtb Dataset T=100K low-mw high-coordinate mult=1 v0.0 + +### Description + +This dataset was generated starting from an adaptation of the tmQM dataset (https://zenodo.org/records/17042449). +This dataset contains 10,235 unique systems with 306,993 total configurations / spin states below 600 Da. The molecules are +limited to containing transition metals Pd, Zn, Fe, or Cu, and also only contain elements Br, C, H, P, S, O, N, F, Cl, +or Br with charges: {-1,0,+1}. The metal is restricted to greater than three coordination sites for Pd, four for Fe, +and one for Cu and Zn. Each molecule was preprocessed using gfn2-xtb, and then a short MD simulation +performed to provide a maximum of 30 off-optimum configurations in addition to the minimized geometry per molecules at +a multiplicity of 1. This singlepoint dataset was then run with the BP86/def2-TZVP for with those geometries from molecular +dynamics using gfn-xtb. Each configuration is reported with the following properties: 'energy', 'gradient', 'dipole', 'quadrupole', +'wiberg_lowdin_indices', 'mayer_indices', 'lowdin_charges' 'dipole_polarizabilities', 'mulliken_charges'. SMILES +strings where generated from tmos (https://github.com/openforcefield/tmos) when possible. These SMILES strings can be +imported into RDKit for initial visualization, but will not reflect the coordinate geometries presented from tmQm. + +### General Information + +- Date: 2025-08-14 +- Purpose: BP86/def2-TZVP Conformers for single metal complexes with Pd, Fe, Zn, Cu, and change of {-1,0,+1} and multiplicity of 1. MW <= 600 Da, generally high coordinate, and a max of 31 geometry samples +- Dataset Type: singlepoint +- Name: tmQM xtb Dataset T=100K low-mw high-coordinate mult=1 v0.0 +- Number of unique molecules: 10,235 +- Number of filtered molecules: 0 +- Number of Conformers: 306,993 +- Number of conformers (min mean max): 3, 30, 31 +- Molecular Weight (min mean max): 95 462 600 +- Set of charges: -1.0, 0.0, 1.0 +- Dataset Submitter: Jennifer A. Clark +- Dataset Curator: Christopher R. Iacovella + +### QCSubmit generation pipeline + +- `generate-dataset.ipynb`: A python notebook which shows how the dataset was prepared from the input files. + +### QCSubmit Manifest + +- `generate-dataset.ipynb` +- `environment.yml`: Conda environment file to perform this workflow +- `environment_full.yml`: All installed packages with versions for successful completion of this workflow +- `scaffold.json.bz2`: A compressed json file of the target dataset + +### Metadata + +* Elements: {'Br', 'C', 'Cl', 'Cu', 'F', 'Fe', 'H', 'N', 'O', 'P', 'Pd', 'S', 'Zn'} +* QC Specifications: BP86/def2-TZVP + * program: psi4 + * method: BP86 + * basis: def2-TZVP + * driver: gradient + * implicit_solvent: None + * keywords: {} + * maxiter: 500 + * SCF Properties: + * dipole + * quadrupole + * wiberg_lowdin_indices + * mayer_indices + * lowdin_charges + * dipole_polarizabilities + * mulliken_charges \ No newline at end of file diff --git a/submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-low-mw-high-coordinate-mult=1-v0.0/environment.yml b/submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-low-mw-high-coordinate-mult=1-v0.0/environment.yml new file mode 100644 index 00000000..68b40750 --- /dev/null +++ b/submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-low-mw-high-coordinate-mult=1-v0.0/environment.yml @@ -0,0 +1,22 @@ +name: qca-clean +channels: + - conda-forge +dependencies: + - python=3.11 + - numpy + - jupyter + - pandas + - h5py + - periodictable + - qcportal>=0.61 + - qcfractal>=0.61 + - qcfractalcompute>=0.59 + - rdkit>=2025.3.3 + - openbabel + - deepdiff + - py3Dmol + - scipy + - networkx + - pip: + - git+https://github.com/openforcefield/tmos.git + - git+https://github.com/MDAnalysis/mdanalysis.git@develop#subdirectory=package # delete after 2.10 is released \ No newline at end of file diff --git a/submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-low-mw-high-coordinate-mult=1-v0.0/environment_full.yaml b/submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-low-mw-high-coordinate-mult=1-v0.0/environment_full.yaml new file mode 100644 index 00000000..a31ee57d --- /dev/null +++ b/submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-low-mw-high-coordinate-mult=1-v0.0/environment_full.yaml @@ -0,0 +1,242 @@ +name: qca-clean +channels: + - conda-forge +dependencies: + - anyio=4.10.0=pyhe01879c_0 + - appnope=0.1.4=pyhd8ed1ab_1 + - argon2-cffi=25.1.0=pyhd8ed1ab_0 + - argon2-cffi-bindings=25.1.0=py311h3696347_0 + - arrow=1.3.0=pyhd8ed1ab_1 + - asttokens=3.0.0=pyhd8ed1ab_1 + - async-lru=2.0.5=pyh29332c3_0 + - attrs=25.3.0=pyh71513ae_0 + - babel=2.17.0=pyhd8ed1ab_0 + - beautifulsoup4=4.13.4=pyha770c72_0 + - bleach=6.2.0=pyh29332c3_4 + - bleach-with-css=6.2.0=h82add2a_4 + - brotli=1.1.0=h5505292_3 + - brotli-bin=1.1.0=h5505292_3 + - brotli-python=1.1.0=py311h155a34a_3 + - bzip2=1.0.8=h99b78c6_7 + - c-ares=1.34.5=h5505292_0 + - ca-certificates=2025.8.3=hbd8a1cb_0 + - cached-property=1.5.2=hd8ed1ab_1 + - cached_property=1.5.2=pyha770c72_1 + - cairo=1.18.4=h6a3b0d2_0 + - certifi=2025.8.3=pyhd8ed1ab_0 + - cffi=1.17.1=py311h3a79f62_0 + - chardet=5.2.0=pyhd8ed1ab_3 + - charset-normalizer=3.4.3=pyhd8ed1ab_0 + - comm=0.2.3=pyhe01879c_0 + - contourpy=1.3.3=py311h57a9ea7_1 + - cycler=0.12.1=pyhd8ed1ab_1 + - cyrus-sasl=2.1.28=ha1cbb27_0 + - debugpy=1.8.16=py311ha59bd64_0 + - decorator=5.2.1=pyhd8ed1ab_0 + - deepdiff=8.6.0=pyhe01879c_0 + - defusedxml=0.7.1=pyhd8ed1ab_0 + - exceptiongroup=1.3.0=pyhd8ed1ab_0 + - executing=2.2.0=pyhd8ed1ab_0 + - font-ttf-dejavu-sans-mono=2.37=hab24e00_0 + - font-ttf-inconsolata=3.000=h77eed37_0 + - font-ttf-source-code-pro=2.038=h77eed37_0 + - font-ttf-ubuntu=0.83=h77eed37_3 + - fontconfig=2.15.0=h1383a14_1 + - fonts-conda-ecosystem=1=0 + - fonts-conda-forge=1=0 + - fonttools=4.59.1=py311h2fe624c_0 + - fqdn=1.5.1=pyhd8ed1ab_1 + - freetype=2.13.3=hce30654_1 + - freetype-py=2.3.0=pyhd8ed1ab_0 + - greenlet=3.2.4=py311hf719da1_0 + - h11=0.16.0=pyhd8ed1ab_0 + - h2=4.2.0=pyhd8ed1ab_0 + - h5py=3.14.0=nompi_py311h8470beb_100 + - hdf5=1.14.6=nompi_he65715a_103 + - hpack=4.1.0=pyhd8ed1ab_0 + - httpcore=1.0.9=pyh29332c3_0 + - httpx=0.28.1=pyhd8ed1ab_0 + - hyperframe=6.1.0=pyhd8ed1ab_0 + - icu=75.1=hfee45f7_0 + - idna=3.10=pyhd8ed1ab_1 + - importlib-metadata=8.7.0=pyhe01879c_1 + - ipykernel=6.30.1=pyh92f572d_0 + - ipython=9.4.0=pyhfa0c392_0 + - ipython_pygments_lexers=1.1.1=pyhd8ed1ab_0 + - ipywidgets=8.1.7=pyhd8ed1ab_0 + - isoduration=20.11.0=pyhd8ed1ab_1 + - jedi=0.19.2=pyhd8ed1ab_1 + - jinja2=3.1.6=pyhd8ed1ab_0 + - json5=0.12.1=pyhd8ed1ab_0 + - jsonpointer=3.0.0=py311h267d04e_1 + - jsonschema=4.25.0=pyhe01879c_0 + - jsonschema-specifications=2025.4.1=pyh29332c3_0 + - jsonschema-with-format-nongpl=4.25.0=he01879c_0 + - jupyter=1.1.1=pyhd8ed1ab_1 + - jupyter-lsp=2.2.6=pyhe01879c_0 + - jupyter_client=8.6.3=pyhd8ed1ab_1 + - jupyter_console=6.6.3=pyhd8ed1ab_1 + - jupyter_core=5.8.1=pyh31011fe_0 + - jupyter_events=0.12.0=pyh29332c3_0 + - jupyter_server=2.16.0=pyhe01879c_0 + - jupyter_server_terminals=0.5.3=pyhd8ed1ab_1 + - jupyterlab=4.4.6=pyhd8ed1ab_0 + - jupyterlab_pygments=0.3.0=pyhd8ed1ab_2 + - jupyterlab_server=2.27.3=pyhd8ed1ab_1 + - jupyterlab_widgets=3.0.15=pyhd8ed1ab_0 + - kiwisolver=1.4.9=py311h63e5c0c_0 + - krb5=1.21.3=h237132a_0 + - lark=1.2.2=pyhd8ed1ab_1 + - lcms2=2.17=h7eeda09_0 + - lerc=4.0.0=hd64df32_1 + - libaec=1.1.4=h51d1e36_0 + - libblas=3.9.0=34_h10e41b3_openblas + - libboost=1.86.0=hc9fb7c5_3 + - libboost-python=1.86.0=py311h8fc16d6_3 + - libbrotlicommon=1.1.0=h5505292_3 + - libbrotlidec=1.1.0=h5505292_3 + - libbrotlienc=1.1.0=h5505292_3 + - libcblas=3.9.0=34_hb3479ef_openblas + - libcurl=8.14.1=h73640d1_0 + - libcxx=20.1.8=hf598326_1 + - libdeflate=1.24=h5773f1b_0 + - libedit=3.1.20250104=pl5321hafb1f1b_0 + - libev=4.33=h93a5062_2 + - libexpat=2.7.1=hec049ff_0 + - libffi=3.4.6=h1da3d7d_1 + - libfreetype=2.13.3=hce30654_1 + - libfreetype6=2.13.3=h1d14073_1 + - libgfortran=15.1.0=hfdf1602_0 + - libgfortran5=15.1.0=hb74de2c_0 + - libglib=2.84.3=h587fa63_0 + - libiconv=1.18=h23cfdf5_2 + - libintl=0.25.1=h493aca8_0 + - libjpeg-turbo=3.1.0=h5505292_0 + - liblapack=3.9.0=34_hc9a63f6_openblas + - liblzma=5.8.1=h39f12f2_2 + - libnghttp2=1.64.0=h6d7220d_0 + - libntlm=1.8=h5505292_0 + - libopenblas=0.3.30=openmp_h60d53f8_1 + - libpng=1.6.50=h280e0eb_1 + - libpq=17.6=h6846fd6_0 + - librdkit=2025.03.5=hafd8b29_0 + - libsodium=1.0.20=h99b78c6_0 + - libsqlite=3.50.3=hf8de324_1 + - libssh2=1.11.1=h1590b86_0 + - libtiff=4.7.0=h025e3ab_6 + - libwebp-base=1.6.0=h07db88b_0 + - libxcb=1.17.0=hdb1d25a_0 + - libxml2=2.13.8=h4a9ca0c_1 + - libzlib=1.3.1=h8359307_2 + - llvm-openmp=20.1.8=hbb9b287_1 + - markupsafe=3.0.2=py311h4921393_1 + - matplotlib-base=3.10.5=py311h66dac5a_0 + - matplotlib-inline=0.1.7=pyhd8ed1ab_1 + - mistune=3.1.3=pyh29332c3_0 + - munkres=1.1.4=pyhd8ed1ab_1 + - nbclient=0.10.2=pyhd8ed1ab_0 + - nbconvert-core=7.16.6=pyh29332c3_0 + - nbformat=5.10.4=pyhd8ed1ab_1 + - ncurses=6.5=h5e97a16_3 + - nest-asyncio=1.6.0=pyhd8ed1ab_1 + - networkx=3.5=pyhe01879c_0 + - notebook=7.4.5=pyhd8ed1ab_0 + - notebook-shim=0.2.4=pyhd8ed1ab_1 + - numpy=2.3.2=py311h0856f98_0 + - openbabel=3.1.1=py311h292ccdb_9 + - openjpeg=2.5.3=h889cd5d_1 + - openldap=2.6.10=hbe55e7a_0 + - openssl=3.5.2=he92f556_0 + - orderly-set=5.5.0=pyhe01879c_0 + - overrides=7.7.0=pyhd8ed1ab_1 + - packaging=25.0=pyh29332c3_1 + - pandas=2.3.1=py311hff7e5bb_0 + - pandocfilters=1.5.0=pyhd8ed1ab_0 + - parso=0.8.4=pyhd8ed1ab_1 + - pcre2=10.45=ha881caa_0 + - periodictable=1.7.1=pyhd8ed1ab_0 + - pexpect=4.9.0=pyhd8ed1ab_1 + - pickleshare=0.7.5=pyhd8ed1ab_1004 + - pillow=11.3.0=py311hb9ba9e9_0 + - pip=25.2=pyh8b19718_0 + - pixman=0.46.4=h81086ad_1 + - platformdirs=4.3.8=pyhe01879c_0 + - prometheus_client=0.22.1=pyhd8ed1ab_0 + - prompt-toolkit=3.0.51=pyha770c72_0 + - prompt_toolkit=3.0.51=hd8ed1ab_0 + - psutil=7.0.0=py311h917b07b_0 + - pthread-stubs=0.4=hd74edd7_1002 + - ptyprocess=0.7.0=pyhd8ed1ab_1 + - pure_eval=0.2.3=pyhd8ed1ab_1 + - py3dmol=2.5.2=pyhd8ed1ab_0 + - pycairo=1.28.0=py311h8a0deb1_0 + - pycparser=2.22=pyh29332c3_1 + - pygments=2.19.2=pyhd8ed1ab_0 + - pyobjc-core=11.1=py311hf0763de_0 + - pyobjc-framework-cocoa=11.1=py311hab620ed_0 + - pyparsing=3.2.3=pyhe01879c_2 + - pysocks=1.7.1=pyha55dd90_7 + - python=3.11.13=hc22306f_0_cpython + - python-dateutil=2.9.0.post0=pyhe01879c_2 + - python-fastjsonschema=2.21.1=pyhd8ed1ab_0 + - python-json-logger=2.0.7=pyhd8ed1ab_0 + - python-tzdata=2025.2=pyhd8ed1ab_0 + - python_abi=3.11=8_cp311 + - pytz=2025.2=pyhd8ed1ab_0 + - pyyaml=6.0.2=py311h4921393_2 + - pyzmq=27.0.1=py311h2637eca_0 + - qhull=2020.2=h420ef59_5 + - rdkit=2025.03.5=py311h1da7121_0 + - readline=8.2=h1d1bf99_2 + - referencing=0.36.2=pyh29332c3_0 + - reportlab=4.4.1=py311h917b07b_0 + - requests=2.32.4=pyhd8ed1ab_0 + - rfc3339-validator=0.1.4=pyhd8ed1ab_1 + - rfc3986-validator=0.1.1=pyh9f0ad1d_0 + - rfc3987-syntax=1.1.0=pyhe01879c_1 + - rlpycairo=0.2.0=pyhd8ed1ab_0 + - rpds-py=0.27.0=py311h1c3fc1a_0 + - scipy=1.16.1=py311hffedffa_0 + - send2trash=1.8.3=pyh31c8845_1 + - setuptools=80.9.0=pyhff2d567_0 + - six=1.17.0=pyhe01879c_1 + - sniffio=1.3.1=pyhd8ed1ab_1 + - soupsieve=2.7=pyhd8ed1ab_0 + - sqlalchemy=2.0.43=py311h3696347_0 + - stack_data=0.6.3=pyhd8ed1ab_1 + - terminado=0.18.1=pyh31c8845_0 + - tinycss2=1.4.0=pyhd8ed1ab_0 + - tk=8.6.13=h892fb3f_2 + - tomli=2.2.1=pyhe01879c_2 + - tornado=6.5.2=py311h3696347_0 + - traitlets=5.14.3=pyhd8ed1ab_1 + - types-python-dateutil=2.9.0.20250809=pyhd8ed1ab_0 + - typing-extensions=4.14.1=h4440ef1_0 + - typing_extensions=4.14.1=pyhe01879c_0 + - typing_utils=0.1.0=pyhd8ed1ab_1 + - tzdata=2025b=h78e105d_0 + - unicodedata2=16.0.0=py311h917b07b_0 + - uri-template=1.3.0=pyhd8ed1ab_1 + - urllib3=2.5.0=pyhd8ed1ab_0 + - wcwidth=0.2.13=pyhd8ed1ab_1 + - webcolors=24.11.1=pyhd8ed1ab_0 + - webencodings=0.5.1=pyhd8ed1ab_3 + - websocket-client=1.8.0=pyhd8ed1ab_1 + - wheel=0.45.1=pyhd8ed1ab_1 + - widgetsnbextension=4.0.14=pyhd8ed1ab_0 + - xorg-libxau=1.0.12=h5505292_0 + - xorg-libxdmcp=1.1.5=hd74edd7_0 + - yaml=0.2.5=h925e9cb_3 + - zeromq=4.3.5=hc1bb282_7 + - zipp=3.23.0=pyhd8ed1ab_0 + - zstandard=0.23.0=py311h917b07b_2 + - zstd=1.5.7=h6491c7d_2 + - pip: + - MDAnalysis==2.10.0.dev0 + - qcarchivetesting==0.62.post11+g5735f6503 + - qcfractal==0.62.post11+g5735f6503 + - qcfractalcompute==0.62.post11+g5735f6503 + - qcportal==0.62.post11+g5735f6503 + - tmos==1.0.0+33.g2b4f7f8 + +prefix: "/Users/jenniferclark/mamba/envs/qca-clean" diff --git a/submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-low-mw-high-coordinate-mult=1-v0.0/generate_dataset.ipynb b/submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-low-mw-high-coordinate-mult=1-v0.0/generate_dataset.ipynb new file mode 100644 index 00000000..ecba9eb2 --- /dev/null +++ b/submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-low-mw-high-coordinate-mult=1-v0.0/generate_dataset.ipynb @@ -0,0 +1,470 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "f294053bff1a44e29f8c08e8f3b1004b", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import contextlib\n", + "from datetime import date\n", + "from collections import Counter, defaultdict\n", + "import warnings\n", + "\n", + "import periodictable\n", + "import h5py\n", + "import numpy as np\n", + "\n", + "import qcportal\n", + "from qcportal.external import scaffold\n", + "from qcportal.molecules import Molecule\n", + "from qcportal.singlepoint import SinglepointDriver, QCSpecification\n", + "from qcelemental.physical_constants import constants\n", + "\n", + "import tmos\n", + "warnings.filterwarnings(\"ignore\", module=\"tmos\")\n", + "\n", + "ADDRESS = \"https://api.qcarchive.molssi.org:443\"\n", + "#qc_client = qcportal.PortalClient(ADDRESS, cache_dir=\".\")\n", + "from qcfractal.snowflake import FractalSnowflake\n", + "import warnings\n", + "snowflake = FractalSnowflake()\n", + "client = snowflake.client()" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [], + "source": [ + "#!aria2c \"https://zenodo.org/records/17042449/files/tmqm_dataset_xtb_T100_raw_ext.hdf5.gz?download=1\"" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Helper Functions" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [], + "source": [ + "def remove_extraneous_dimension(array):\n", + " shape = list(np.shape(array))\n", + " if 1 in shape:\n", + " shape.remove(1)\n", + " return np.array(array).reshape(shape)\n", + "\n", + "def get_symbols(atomic_numbers):\n", + " return [str(periodictable.elements[x])for x in remove_extraneous_dimension(atomic_numbers)]\n", + "\n", + "def get_molecular_formula(atomic_numbers):\n", + " return \"\".join([str(y) for x1, x2 in Counter(get_symbols(atomic_numbers)).items() for y in [x1, x2] if y != 1])\n", + "\n", + "def get_molecular_weight(atomic_numbers):\n", + " return sum(periodictable.elements[x].mass for x in remove_extraneous_dimension(atomic_numbers))\n" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "def apply_mapping(mapping, input_dict, index=0, lx=None):\n", + "\n", + " output = defaultdict(dict)\n", + " for key, value in mapping.items():\n", + " if isinstance(value, str):\n", + " data = input_dict[value]\n", + "# print(value, np.shape(data), lx)\n", + " if not isinstance(data, str):\n", + " #if key == \"molecular_multiplicity\" and np.size(data) > 1: # sometimes multiplicity is [3,3] for some reason\n", + " # data = data[-1]\n", + " if key == \"geometry\" and np.shape(data)[0] != lx: # update number of frames\n", + " lx = np.shape(data)[0]\n", + " \n", + " if key != \"geometry\":\n", + " data = remove_extraneous_dimension(data)\n", + "\n", + "# print(value, np.shape(data), lx)\n", + " if lx is not None: # and len(np.shape(data)) > 1:\n", + " if len(data) == lx:\n", + " output[key] = data[index]\n", + " continue\n", + " else:\n", + " raise ValueError(f\"Expected {lx} configuration, but {len(data)} are present\")\n", + " output[key] = data\n", + " elif isinstance(value, tuple): # function, input pairs\n", + " output[key] = value[0](*(input_dict[k2] for k2 in value[1:]))\n", + " elif isinstance(value, list):\n", + " output[key].update({k2: input_dict[k2] for k2 in value})\n", + " elif isinstance(value, dict):\n", + " output[key].update(apply_mapping(value, input_dict, index=0, lx=None))\n", + " \n", + " return output\n", + " \n", + "def convert_hdf5_group(hdf5_group):\n", + " output = {}\n", + " for key, value in hdf5_group.items():\n", + " if isinstance(value, h5py.Group):\n", + " output[key] = convert_hdf5_group(value)\n", + " elif isinstance(value, h5py.Dataset):\n", + " data = value[()]\n", + " if isinstance(data, np.ndarray):\n", + " output[key] = data\n", + " elif isinstance(data, np.bytes_):\n", + " output[key] = data.decode('utf-8') # Convert to string\n", + " else:\n", + " output[key] = data.item() if isinstance(data, np.generic) else data # Convert NumPy scalars\n", + " else:\n", + " output[key] = value\n", + "\n", + " return output" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Assembled Dataset" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [], + "source": [ + "dataset_name = \"tmQM xtb Dataset T=100K low-mw high-coordinate mult=1 v0.0\"\n", + "tagline = \"BP86/def2-TZVP Conformers for single metal complexes with Pd, Fe, Zn, Cu, and change of {-1,0,+1} and multiplicity of 1. MW <= 600 Da, generally high coordinate, and a max of 31 geometry samples\"\n", + "description = (\"\"\"\n", + "This dataset was generated starting from an adaptation of the tmQM dataset (https://zenodo.org/records/17042449). \n", + "This dataset contains 10,235 unique systems with 306,993 total configurations / spin states below 600 Da. The molecules are \n", + "limited to containing transition metals Pd, Zn, Fe, or Cu, and also only contain elements Br, C, H, P, S, O, N, F, Cl, \n", + "or Br with charges: {-1,0,+1}. The metal is restricted to greater than three coordination sites for Pd, four for Fe, \n", + "and one for Cu and Zn. Each molecule was preprocessed using gfn2-xtb, and then a short MD simulation\n", + "performed to provide a maximum of 30 off-optimum configurations in addition to the minimized geometry per molecules at \n", + "a multiplicity of 1. This singlepoint dataset was then run with the BP86/def2-TZVP for with those geometries from molecular \n", + "dynamics using gfn-xtb. Each configuration is reported with the following properties: 'energy', 'gradient', 'dipole', 'quadrupole',\n", + "'wiberg_lowdin_indices', 'mayer_indices', 'lowdin_charges' 'dipole_polarizabilities', 'mulliken_charges'. SMILES\n", + "strings where generated from tmos (https://github.com/openforcefield/tmos) when possible. These SMILES strings can be\n", + "imported into RDKit for initial visualization, but will not reflect the coordinate geometries presented from tmQm.\n", + "\"\"\")\n", + "\n", + "dataset = client.add_dataset( # https://docs.qcarchive.molssi.org/user_guide/qcportal_reference.html\n", + " \"singlepoint\", # collection type\n", + " dataset_name, # Dataset name\n", + " tagline=tagline,\n", + " description=description,\n", + " tags=[\"openff\"],\n", + " provenance={\n", + " \"qcportal\": qcportal.__version__,\n", + " },\n", + " default_tag=\"openff\",\n", + " extras={\n", + " \"submitter\": \"jaclark5\",\n", + " \"creation_date\": date.today(),\n", + " 'collection_type': 'SinglepointDataset',\n", + " \"long_description\": description,\n", + " 'long_description_url': f'https://github.com/openforcefield/qca-dataset-submission/tree/master/submissions/{dataset_name.replace(\" \", \"-\")}',\n", + " \"short_description\": tagline,\n", + " \"dataset_name\": dataset_name,\n", + " },\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/Users/jenniferclark/bin/mdanalysis/package/MDAnalysis/converters/RDKitInferring.py:109: UserWarning: '_MDAnalysis_index' not available on the input mol atoms, skipping reordering of atoms.\n", + " warnings.warn(\n", + "/Users/jenniferclark/bin/mdanalysis/package/MDAnalysis/converters/RDKitInferring.py:633: UserWarning: The standardization could not be completed within a reasonable number of iterations\n", + " warnings.warn(\n" + ] + } + ], + "source": [ + "METALS_SYMBOLS = [periodictable.elements[x].symbol for x in tmos.reference_values.METALS_NUM]\n", + "\n", + "hdf5_mapping = {\n", + " \"symbols\": (get_symbols, \"atomic_numbers\"), \n", + " \"geometry\": \"geometry\",\n", + " \"molecular_charge\": \"total_charge\",\n", + " \"molecular_multiplicity\": \"spin_multiplicity\",\n", + " \"identifiers\": {\"molecular_formula\": (get_molecular_formula, \"atomic_numbers\"),},\n", + " \"extras\": {'molecular_weight': (get_molecular_weight, \"atomic_numbers\")},\n", + "}\n", + "\n", + "elements, molecular_weights, charges = [], [], []\n", + "conformers = Counter()\n", + "count_molecules = 0\n", + "\n", + "errors = defaultdict(list)\n", + "errors_mult = []\n", + "errors_misc = defaultdict(lambda: defaultdict(list))\n", + "failed_metals = defaultdict(lambda: 0)\n", + "count_no = 0\n", + "\n", + "mult = 1\n", + "hdf5 = h5py.File(f\"tmqm_dataset_xtb_T100_raw_ext.hdf5\", 'r')\n", + "for ii, (label, mol_hdf5) in enumerate(hdf5.items()):\n", + " if f\"sm{mult}\" not in label:\n", + " continue\n", + "\n", + " mol_dict = convert_hdf5_group(mol_hdf5)\n", + " lx = mol_dict[\"n_configs\"]\n", + " \n", + " ## Decide to filter\n", + " try:\n", + " input = apply_mapping(hdf5_mapping, mol_dict, index=0, lx=lx-1)\n", + " except Exception as e:\n", + " continue\n", + " input[\"geometry\"] *= 10 # Convert from nm to Angstroms\n", + " \n", + " rdmol_draft = tmos.build_rdmol.xyz_to_rdkit(input[\"symbols\"], input[\"geometry\"], ignore_scale=True)\n", + " tm_idx = tmos.find_metal_index(rdmol_draft)\n", + " _, n_bonds = tmos.geometry.get_geometry_from_mol(rdmol_draft, tm_idx)\n", + " if (\n", + " (n_bonds < 5 and rdmol_draft.GetAtoms()[tm_idx].GetSymbol() == \"Fe\") or\n", + " (n_bonds < 4 and rdmol_draft.GetAtoms()[tm_idx].GetSymbol() == \"Pd\") or \n", + " n_bonds < 2\n", + " ):\n", + " count_no += 1\n", + " continue\n", + " try:\n", + " result = tmos.sanitize_complex(rdmol_draft, value_missing_coord=np.nan)\n", + " except Exception as e:\n", + " metal = [x for x in input['symbols'] if x in METALS_SYMBOLS][0]\n", + " errors[str(e)[:40].strip()].append([label, metal, e, tmos.utils.first_traceback()])\n", + " result = None\n", + " ## Import conformers\n", + " for i in range(lx-1):\n", + " # Get values from HDF5\n", + " qc_input = apply_mapping(hdf5_mapping, mol_dict, index=i, lx=lx-1)\n", + " qc_input[\"geometry\"] *= 10 / constants.bohr2angstroms # Convert from nm to Bohr (a0)\n", + " if result is not None:\n", + " for key in result.keys():\n", + " if result[key][\"complex_info\"][\"total_charge\"] == qc_input[\"molecular_charge\"]:\n", + " qc_input[\"identifiers\"][\"smiles\"] = result[key][\"complex_info\"][\"smiles\"]\n", + " \n", + " try:\n", + " molecule = Molecule(\n", + " name=label,\n", + " fix_com=True,\n", + " fix_orientation=True,\n", + " fix_symmetry=\"c1\",\n", + " comment=\"Molecule coordinates taken from tmQM and SMILES from tmos\",\n", + " **qc_input\n", + " )\n", + " dataset.add_entry(name=label+f\"_{i}\", molecule=molecule)\n", + " count_molecules += 1\n", + " conformers[label[:-4]] += 1\n", + " except Exception as e:\n", + " if \"Inconsistent or unspecified chg/mult\" in str(e):\n", + " errors_mult.append(label)\n", + " else:\n", + " errors_misc[str(e)[:30]][label].append([i, str(e)])\n", + " continue\n", + "\n", + " elements.extend(list(set(qc_input['symbols'])))\n", + " molecular_weights.append(qc_input['extras'][\"molecular_weight\"])\n", + " charges.append(qc_input[\"molecular_charge\"])\n", + "\n", + "dataset.extras[\"elements\"] = sorted(list(set(elements)))\n", + " " + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Number of molecules removed for solvent assessment: 3134\n", + "Number of molecules removed for unspecified chg/mult: 0\n", + "Number of conformers accepted: 306993\n" + ] + } + ], + "source": [ + "print(f\"Number of molecules removed for solvent assessment: {count_no}\")\n", + "print(f\"Number of molecules removed for unspecified chg/mult: {len(errors_mult)}\")\n", + "print(f\"Number of conformers accepted: {len(dataset.entry_names)}\")" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "306993 conformers were imported.\n", + "\n", + "The following errors DO remove molecules from the dataset:\n", + "\n", + "There were 269 molecules of 306993 that failed to create SMILES.\n" + ] + } + ], + "source": [ + "print(f\"{len(dataset.entry_names)} conformers were imported.\")\n", + "\n", + "print(\"\\nThe following errors DO remove molecules from the dataset:\")\n", + "for err, values in errors_misc.items():\n", + " print(f\" {len(values)}: '{err}'\")\n", + "\n", + "print(f\"\\nThere were {sum([len(x) for x in errors.values()])} molecules of {len(dataset.entry_names)} that failed to create SMILES.\")" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[]" + ] + }, + "execution_count": 9, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "errors_misc[\"SinglepointDataset.add_entry()\"]['ABACAL_sm1']" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "InsertMetadata(error_description=None, errors=[], inserted_idx=[0], existing_idx=[])" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "spec = QCSpecification(\n", + " program='psi4',\n", + " driver=SinglepointDriver.gradient,\n", + " method='BP86',\n", + " basis='def2-TZVP',\n", + " keywords={\n", + " 'maxiter': 500, \n", + " 'scf_properties': ['dipole', 'quadrupole', 'wiberg_lowdin_indices', 'mayer_indices', 'lowdin_charges', 'mulliken_charges'],\n", + " 'function_kwargs': {'properties': ['dipole_polarizabilities']},\n", + " },\n", + " protocols={'wavefunction': 'none'}\n", + " )\n", + "dataset.add_specification(name=\"BP86/def2-TZVP\", specification=spec)" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "scaffold.to_json(dataset, compress=True)\n", + "#dataset.submit()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Make Outputs" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Elements: ['Br', 'C', 'Cl', 'Cu', 'F', 'Fe', 'H', 'N', 'O', 'P', 'Pd', 'S', 'Zn']\n", + "Charges: [-1.0, 0.0, 1.0]\n", + "Molecular Weight (min mean max): 95 462 600\n", + "Number of Molecules: 10235\n", + "Number of Conformers: 306993\n", + "Number of conformers (min mean max): 3 30 31\n" + ] + } + ], + "source": [ + "print(\"Elements:\", dataset.extras[\"elements\"])\n", + "print(\"Charges:\", sorted(set(charges)))\n", + "print(\"Molecular Weight (min mean max):\", int(np.min(molecular_weights)), int(np.mean(molecular_weights)), int(np.max(molecular_weights)))\n", + " \n", + "print(\"Number of Molecules:\", len(conformers))\n", + "print(\"Number of Conformers:\", sum(conformers.values()))\n", + "n_conformers = np.array(list(conformers.values())) + 1\n", + "print(\"Number of conformers (min mean max):\", int(np.min(n_conformers)), int(np.mean(n_conformers)), int(np.max(n_conformers)))" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "qca", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.12" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-low-mw-high-coordinate-mult=1-v0.0/scaffold.json.bz2 b/submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-low-mw-high-coordinate-mult=1-v0.0/scaffold.json.bz2 new file mode 100644 index 00000000..68f8b4bd --- /dev/null +++ b/submissions/2025-08-14-tmQM-xtb-Dataset-T=100K-low-mw-high-coordinate-mult=1-v0.0/scaffold.json.bz2 @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:11a906e153914624a99169856c325bfcb28209c9ccb4fbd25a023c6efdc4c400 +size 229164257