Skip to content

Commit e67a7d0

Browse files
authored
Merge pull request #27 from HzaCode/feat/enhance-get-properties
🚀 ChemInformant v2.4.0 - Core `get_properties()` Enhancement
2 parents 6217455 + d80ae82 commit e67a7d0

23 files changed

+2376
-365
lines changed

CITATION.cff

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@ authors:
55
given-names: "Zhiang"
66
orcid: "https://orcid.org/0009-0009-0171-4578"
77
title: "ChemInformant"
8-
version: 2.2.2
8+
version: 2.4.0
99
date-released: 2025
1010
url: "https://github.com/HzaCode/ChemInformant"
1111
license: MIT

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -99,11 +99,12 @@ print(df)
9999
| `get_cas(id)` | CAS Registry Number *(str)* |
100100
| `get_iupac_name(id)` | IUPAC name *(str)* |
101101
| `get_canonical_smiles(id)` | Canonical SMILES with Canonical→Connectivity fallback *(str)* |
102-
| `get_isomeric_smiles(id)` | Isomeric SMILES *(str)* |
102+
| `get_isomeric_smiles(id)` | Isomeric SMILES with Isomeric→SMILES fallback *(str)* |
103103
| `get_xlogp(id)` | XLogP (calculated hydrophobicity) *(float)* |
104104
| `get_synonyms(id)` | List of synonyms *(List[str])* |
105105
| `get_compound(id)` | Full, validated **`Compound`** object (Pydantic v2 model) |
106106

107+
*Note: This table shows key convenience functions for demonstration. ChemInformant provides **22 convenience functions** in total, covering molecular descriptors, mass properties, stereochemistry, and more.*
107108

108109
*All functions accept a **CID, name, or SMILES** and return `None`/`[]` on failure.*
109110

benchmark.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11

22
"""
33
ChemInformant vs. PubChemPy
4-
(285 unique drugs × 6 properties)
4+
(285 unique drugs x 6 properties)
55
66
Output:
77
• PubChemPy bulk time

docs/source/advanced_usage.rst

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,9 @@ Application in Real-World Scientific Workflows
44

55
The core value of ChemInformant lies in its role as a starting point for data science workflows, seamlessly injecting chemical data into Python's powerful scientific computing ecosystem. This page will demonstrate through three cases that more closely resemble real-world research scenarios how ChemInformant can be combined with advanced libraries like **RDKit**, **Scikit-learn**, and **NetworkX** to accomplish diverse tasks ranging from data preprocessing and multi-class classification to community detection.
66

7+
.. note::
8+
All examples use ChemInformant's standardized **snake_case** property names for consistent data handling across workflows.
9+
710
.. note::
811
The examples on this page depend on additional specialized libraries.
912

@@ -100,12 +103,14 @@ We can use the **data obtained from ChemInformant** as features to train a machi
100103
ids.extend(drugs)
101104
labels.extend([cls] * len(drugs))
102105
103-
# 2. Use ci to get feature data and calculate TPSA with RDKit
104-
df_feat = ci.get_properties(ids, ['molecular_weight', 'xlogp', 'isomeric_smiles'])
106+
# 2. Use ci to get comprehensive feature data efficiently
107+
# NEW: Using all_properties for comprehensive dataset
108+
df_feat = ci.get_properties(ids, all_properties=True)
105109
df_feat_clean = df_feat[df_feat['status'] == 'OK'].copy()
106-
df_feat_clean['tpsa'] = df_feat_clean['isomeric_smiles'].apply(
107-
lambda s: rdMolDescriptors.CalcTPSA(Chem.MolFromSmiles(s))
108-
)
110+
111+
# Extract key features already available from ChemInformant
112+
features = ['molecular_weight', 'xlogp', 'tpsa', 'h_bond_donor_count',
113+
'h_bond_acceptor_count', 'rotatable_bond_count']
109114
110115
# 3. Prepare training data and perform cross-validation
111116
features = ['molecular_weight', 'xlogp', 'tpsa']

docs/source/api/api_helpers.rst

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
================================
1+
=====================================
22
Internal API Helpers (`api_helpers`)
3-
================================
3+
=====================================
44

55
.. module:: ChemInformant.api_helpers
66

docs/source/api/cheminfo_api.rst

Lines changed: 36 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
=============================
1+
====================================
22
Main API Interface (`cheminfo_api`)
3-
=============================
3+
====================================
44

55
.. module:: ChemInformant.cheminfo_api
66

@@ -21,13 +21,47 @@ This module is the main entry point for all user interactions. It is designed ar
2121

2222
.. rubric:: Convenience Lookups
2323

24+
**Basic Properties**
25+
2426
.. autofunction:: get_weight
2527
.. autofunction:: get_formula
2628
.. autofunction:: get_cas
2729
.. autofunction:: get_iupac_name
30+
31+
**SMILES and Identifiers**
32+
2833
.. autofunction:: get_canonical_smiles
2934
.. autofunction:: get_isomeric_smiles
35+
.. autofunction:: get_inchi
36+
.. autofunction:: get_inchi_key
37+
38+
**Molecular Descriptors**
39+
3040
.. autofunction:: get_xlogp
41+
.. autofunction:: get_tpsa
42+
.. autofunction:: get_complexity
43+
44+
**Mass Properties**
45+
46+
.. autofunction:: get_exact_mass
47+
.. autofunction:: get_monoisotopic_mass
48+
49+
**Molecular Counts**
50+
51+
.. autofunction:: get_h_bond_donor_count
52+
.. autofunction:: get_h_bond_acceptor_count
53+
.. autofunction:: get_rotatable_bond_count
54+
.. autofunction:: get_heavy_atom_count
55+
.. autofunction:: get_charge
56+
57+
**Stereochemistry**
58+
59+
.. autofunction:: get_atom_stereo_count
60+
.. autofunction:: get_bond_stereo_count
61+
.. autofunction:: get_covalent_unit_count
62+
63+
**Synonyms and Names**
64+
3165
.. autofunction:: get_synonyms
3266

3367
.. rubric:: Visualization Functions

docs/source/api/models.rst

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
1-
==================================
1+
=======================================
22
Data Models and Exceptions (`models`)
3-
==================================
3+
=======================================
44

55
.. module:: ChemInformant.models
66

@@ -16,7 +16,6 @@ By defining clear data models, this module ensures that all data consumed by the
1616
:show-inheritance:
1717
:member-order: bysource
1818
:exclude-members: __init__
19-
:no-value:
2019

2120
.. rubric:: Custom Exceptions
2221

docs/source/api_reference.rst

Lines changed: 0 additions & 12 deletions
This file was deleted.

docs/source/basic_usage.rst

Lines changed: 29 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,22 @@
1-
==========
1+
===========
22
Basic Usage
3-
==========
3+
===========
44

55
This guide covers the fundamental features of the ChemInformant library, designed to help users quickly get started with common chemical information query tasks.
66

77
.. contents:: Contents
88
:local:
99

10-
----------------------------------------------------
10+
-----------------------------------------------------
1111
Core Functionality: Bulk Fetching of Multiple Properties
12-
----------------------------------------------------
12+
-----------------------------------------------------
1313

1414
The most central feature of ChemInformant is :func:`~ChemInformant.get_properties`. This function is designed for batch processing, allowing users to query multiple chemical properties for a group of compounds in a single call. This approach is significantly more efficient than querying each compound individually in a loop because it effectively consolidates network requests.
1515

16-
The function accepts a list of various identifiers (such as common names, PubChem CIDs, or SMILES strings) and returns a structured Pandas DataFrame, ready for direct use in subsequent data analysis.
16+
The function accepts a list of various identifiers (such as common names, PubChem CIDs, or SMILES strings) and returns a structured Pandas DataFrame with **standardized snake_case column names**, ready for direct use in subsequent data analysis.
17+
18+
.. note::
19+
**Snake_case Property Names**: ChemInformant uses consistent snake_case naming (e.g., ``molecular_weight``, ``h_bond_donor_count``) for all returned data. Both snake_case and CamelCase inputs are accepted, but output is always standardized.
1720

1821
.. code-block:: python
1922
@@ -24,7 +27,7 @@ The function accepts a list of various identifiers (such as common names, PubChe
2427
# (Aspirin, Caffeine, Acetaminophen)
2528
identifiers = ['aspirin', 'caffeine', 1983]
2629
27-
# 2. Specify the properties you want to fetch
30+
# 2. Specify the properties you want to fetch (using snake_case names)
2831
properties_to_fetch = ['molecular_weight', 'xlogp', 'cas', 'iupac_name']
2932
3033
# 3. Call the core function to perform the query
@@ -43,6 +46,24 @@ Output:
4346
caffeine 2519.0 OK 194.19 -0.07 58-08-2 1,3,7-trimethylpurine-2,6-dione
4447
1983 1983.0 OK 151.16 0.51 103-90-2 N-(4-hydroxyphenyl)acetamide
4548
49+
---------------------------------------------------------
50+
Getting All Properties or Including 3D Descriptors
51+
---------------------------------------------------------
52+
53+
ChemInformant offers convenient options to retrieve comprehensive data sets:
54+
55+
.. code-block:: python
56+
57+
# Get all ~40 available properties for a compound
58+
complete_data = ci.get_properties(['aspirin'], all_properties=True)
59+
print(f"Retrieved {len(complete_data.columns)} columns of data")
60+
61+
# Get core properties plus 3D molecular descriptors
62+
data_with_3d = ci.get_properties(['aspirin'], include_3d=True)
63+
64+
# Both approaches are much more efficient than multiple API calls
65+
66+
The ``all_properties=True`` option retrieves every available property from PubChem, including core properties, 3D descriptors, and special properties like CAS numbers and synonyms. The ``include_3d=True`` option adds 3D molecular descriptors to the default core property set.
4667

4768
------------------------------------------------
4869
Getting Complete Information for a Single Compound
@@ -109,6 +130,8 @@ Output:
109130
CAS number for water: 7732-18-5
110131
Molecular formula of ethanol: C2H6O
111132
133+
ChemInformant provides **22 convenience functions** for individual properties, covering molecular descriptors, structural features, mass properties, and identifiers. All functions return None for compounds that cannot be found, making error handling straightforward.
134+
112135
-----------------------------
113136
Visualizing Compound Structures
114137
-----------------------------

docs/source/cli.rst

Lines changed: 62 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -33,18 +33,38 @@ ChemInformant provides a suite of command-line interface (CLI) tools, designed t
3333

3434
.. option:: --props <property_list>
3535

36-
A comma-separated list of properties to precisely specify which data to retrieve for each identifier. If the user does not provide this option, `chemfetch` will use a default set of properties: ``cas,molecular_weight,iupac_name``.
36+
A comma-separated list of properties to precisely specify which data to retrieve for each identifier. If the user does not provide this option, `chemfetch` will use the default core property set (20+ essential properties including molecular_weight, formula, smiles, etc.).
3737

38-
The complete list of available properties includes:
38+
.. note::
39+
**Property Names Use snake_case Format**
40+
41+
ChemInformant uses standardized snake_case property names (e.g., ``molecular_weight``, ``h_bond_donor_count``).
42+
Both snake_case and CamelCase inputs are accepted, but output is always in snake_case for consistency.
43+
44+
The complete list of available properties includes **Core Properties** (default set):
3945

40-
* ``cas``: CAS Registry Number, a common chemical identifier.
4146
* ``molecular_weight``: Molecular weight, in g/mol.
42-
* ``molecular_formula``: Molecular formula, indicating the number of atoms of each element in the compound.
43-
* ``canonical_smiles``: Canonical SMILES string.
44-
* ``isomeric_smiles``: Isomeric SMILES string.
47+
* ``molecular_formula``: Molecular formula.
48+
* ``canonical_smiles``, ``isomeric_smiles``: SMILES representations.
4549
* ``iupac_name``: The systematic name established by IUPAC.
4650
* ``xlogp``: Calculated octanol-water partition coefficient.
47-
* ``synonyms``: A list of all known synonyms.
51+
* ``tpsa``: Topological polar surface area.
52+
* ``complexity``: Molecular complexity score.
53+
* ``h_bond_donor_count``, ``h_bond_acceptor_count``: Hydrogen bonding properties.
54+
* ``rotatable_bond_count``, ``heavy_atom_count``: Molecular structure counts.
55+
* ``charge``: Formal molecular charge.
56+
* ``atom_stereo_count``, ``bond_stereo_count``: Stereochemistry information.
57+
* ``covalent_unit_count``: Number of covalent units.
58+
* ``in_ch_i``, ``in_ch_i_key``: InChI identifiers.
59+
* ``cas``: CAS Registry Number.
60+
* ``synonyms``: List of all known synonyms.
61+
62+
**3D Properties** (available with ``--include-3d``):
63+
64+
* ``volume_3d``: 3D molecular volume.
65+
* ``feature_count_3d``, ``feature_acceptor_count_3d``, etc.: 3D pharmacophore features.
66+
* ``conformer_count_3d``: Number of conformers.
67+
* And more spatial descriptors...
4868

4969
.. option:: -f, --format <format_type>
5070

@@ -55,6 +75,14 @@ ChemInformant provides a suite of command-line interface (CLI) tools, designed t
5575
* ``json``: JSON array output.
5676
* ``sql``: Writes to a SQLite database (requires ``--output``).
5777

78+
.. option:: --include-3d
79+
80+
Include 3D molecular descriptors in addition to the default core properties. This option is ignored when ``--props`` is specified. The 3D properties include volume_3d, feature_count_3d, conformer_count_3d, and other spatial descriptors.
81+
82+
.. option:: --all-properties
83+
84+
Retrieve all ~40 available properties from PubChem, including core properties, 3D descriptors, and special properties like CAS and synonyms. This option is mutually exclusive with ``--props`` and ``--include-3d``.
85+
5886
.. option:: -o, --output <file_path>
5987

6088
Specifies the path for the output file. Required for ``--format sql`` and ignored otherwise.
@@ -75,7 +103,29 @@ ChemInformant provides a suite of command-line interface (CLI) tools, designed t
75103
aspirin 2244 OK 50-78-2 180.16 2-(acetyloxy)benzoic acid
76104
caffeine 2519 OK 58-08-2 194.19 1,3,7-trimethylpurine-2,6-dione
77105
78-
2. **Valid and Invalid Identifiers**
106+
2. **Get All Properties**
107+
108+
.. code-block:: bash
109+
110+
chemfetch aspirin --all-properties --format csv -o aspirin_complete.csv
111+
112+
This retrieves all ~40 available properties for aspirin and saves to CSV.
113+
114+
3. **Include 3D Descriptors**
115+
116+
.. code-block:: bash
117+
118+
chemfetch aspirin --include-3d
119+
120+
This includes 3D molecular descriptors in addition to the core property set.
121+
122+
4. **Custom Property Selection**
123+
124+
.. code-block:: bash
125+
126+
chemfetch aspirin caffeine --props "molecular_weight,xlogp,tpsa,h_bond_donor_count"
127+
128+
5. **Valid and Invalid Identifiers**
79129

80130
.. code-block:: bash
81131
@@ -85,10 +135,10 @@ ChemInformant provides a suite of command-line interface (CLI) tools, designed t
85135

86136
.. code-block:: text
87137
88-
input_identifier cid status cas molecular_weight iupac_name
89-
caffeine 2519 OK 58-08-2 194.19 1,3,7-trimethylpurine-2,6-dione
90-
ThisIsA_FakeCompound <NA> NotFoundError <NA> NaN <NA>
91-
999999999 <NA> NotFoundError <NA> NaN <NA>
138+
input_identifier cid status molecular_weight xlogp cas
139+
caffeine 2519 OK 194.19 -0.07 58-08-2
140+
ThisIsA_FakeCompound <NA> NotFoundError <NA> <NA> <NA>
141+
999999999 <NA> NotFoundError <NA> <NA> <NA>
92142
93143
**Using `chemfetch` in Data Processing Pipelines**
94144

0 commit comments

Comments
 (0)