Skip to content

Conversation

aditya0by0
Copy link
Member

@aditya0by0 aditya0by0 self-assigned this May 7, 2025
@aditya0by0 aditya0by0 requested a review from sfluegel05 May 7, 2025 15:06
@aditya0by0
Copy link
Member Author

Tried running the normal GNN model with undirected graph and looks like for undirected graph the edge attributes needs to twice its number to able to be mapped to undirected edge (which is present as directed in both direction in edge_index).
Also need to check the mapping of right edge_attribute to right edge.

https://wandb.ai/chebai/chebai/runs/fu1tcxf4/logs

[rank0]:     return self._call_impl(*args, **kwargs)
[rank0]:   File "/home/staff/a/akhedekar/miniconda3/envs/gnn/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1750, in _call_impl
[rank0]:     return forward_call(*args, **kwargs)
[rank0]:   File "/home/staff/a/akhedekar/miniconda3/envs/gnn/lib/python3.10/site-packages/torch_geometric/nn/conv/res_gated_graph_conv.py", line 128, in forward
[rank0]:     out = self.propagate(edge_index, k=k, q=q, v=v, edge_attr=edge_attr)
[rank0]:   File "/home/staff/a/akhedekar/atmp_dir/torch_geometric.nn.conv.res_gated_graph_conv_ResGatedGraphConv_propagate_x5frmxhf.py", line 231, in propagate
[rank0]:     out = self.message(
[rank0]:   File "/home/staff/a/akhedekar/miniconda3/envs/gnn/lib/python3.10/site-packages/torch_geometric/nn/conv/res_gated_graph_conv.py", line 144, in message
[rank0]:     k_i = self.lin_key(torch.cat([k_i, edge_attr], dim=-1))
[rank0]: RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 2438 but got size 1219 for tensor number 1 in the list.

@aditya0by0
Copy link
Member Author

Training has been started for this change : https://wandb.ai/chebai/chebai/runs/9xjpb6wi?nw=nwuseraditya0by0

Another job started with 2 gpus: https://wandb.ai/chebai/chebai/runs/ejg3ksex?nw=nwuseraditya0by0

@aditya0by0
Copy link
Member Author

@sfluegel05,

The training seems to be quite slow. I'm wondering if all of the following properties were actually used in the original training setup. Could you please share the corresponding Weights & Biases (wandb) link for the original run?

Encoding lengths are: 
[('AtomAromaticity', 1), 
 ('AtomCharge', 13), 
 ('AtomHybridization', 7), 
 ('AtomNumHs', 7), 
 ('AtomType', 119), 
 ('BondAromaticity', 1), 
 ('BondInRing', 1), 
 ('BondType', 5), 
 ('NumAtomBonds', 11), 
 ('RDKit2DNormalized', 200)]

@sfluegel05
Copy link
Contributor

Hi, I can confirm that the properties were used in actual runs, e.g. this one: https://wandb.ai/chebai/chebai/runs/cxmgl4eb (the technical setup is not the same one we use now, making it hard to compare, but I would expect it to get better with our current setup, not worse).

The bottleneck for this model is the creation of the dataset (especially RDKit2DNormalized). But one that is done, I would expect normal-ish behaviour during training.

instead of Base data module, as `load_processed_data_from_file` method used in this class is available in Dynamic dataset class
@aditya0by0
Copy link
Member Author

@sfluegel05, Please find the training below for this fix.
https://wandb.ai/chebai/chebai/runs/7h1icve9?nw=nwuseraditya0by0

Please review and merge.

@aditya0by0 aditya0by0 added the bug Something isn't working label May 25, 2025
@aditya0by0 aditya0by0 mentioned this pull request May 28, 2025
@aditya0by0 aditya0by0 added bug:fix fix for bug and removed bug Something isn't working labels May 28, 2025
@sfluegel05
Copy link
Contributor

I'm not sure what the run is telling me. Training seems to be doing fine, but the macro-f1 is a few percent lower compared to https://wandb.ai/chebai/chebai/runs/0oksfx9u

Does that mean that undirected is worse than directed? Or am I missing something here?

@aditya0by0
Copy link
Member Author

image
I noticed that you have mentioned a particular comment on this run. Is there something different you did with the token limit?

@aditya0by0
Copy link
Member Author

aditya0by0 commented May 30, 2025

Also, for your run, have you changed the matmul precision?. For my runs, I haven't changed anything particular to precision. So the default precision for my run is "highest".

You are using a CUDA device ('NVIDIA H100 80GB HBM3') that has Tensor Cores. To properly utilize them, you should set `torch.set_float32_matmul_precision('medium' | 'high')` which will trade-off precision for performance. For more details, read https://pytorch.org/docs/stable/generated/torch.set_float32_matmul_precision.html#torch.set_float32_matmul_precision
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/1

@sfluegel05
Copy link
Contributor

I noticed that you have mentioned a particular comment on this run. Is there something different you did with the token limit?
The token_limit was only for the Electra model, not for the GNN. So no token limit was applied

Also, for your run, have you changed the matmul precision?. For my runs, I haven't changed anything particular to precision. So the default precision for my run is "highest".

I have been using the default precision for my run as well (32-true)

@aditya0by0
Copy link
Member Author

@aditya0by0
Copy link
Member Author

Directed: https://wandb.ai/chebai/chebai/runs/5yhpkxci/overview
Undirected: https://wandb.ai/chebai/chebai/runs/dlt1iug5/overview

Metric Directed Undirected
Train Loss (epoch) 0.00087 0.00140
Train Loss (step) 0.00077 0.00218
Train Macro-F1 0.9603 0.9274
Train Micro-F1 0.9903 0.9830
Global Step 62,799 62,799
Val Loss (epoch) 0.02067 0.01783
Val Loss (step) 0.01475 0.00697
Val Macro-F1 0.6810 0.6635
Val Micro-F1 0.9094 0.9067

@aditya0by0
Copy link
Member Author

I repeated the training on the same GPU type after making the training deterministic, and the results are consistent with the earlier observation:
➡️ Directed graphs outperform undirected graphs.

Undirected Graph (Deterministic runs):

Directed Graph (Deterministic runs):

@aditya0by0
Copy link
Member Author

aditya0by0 commented Jul 29, 2025

I have few hypothesis for why directed graph perform better than undirected for an GNN end to end classification task.

  • Two undirected graphs with the same molecular structure ( but with different atom types and node features) are more likely to produce similar graph-level representations after 5 GNN convolution layers, compared to their directed counterparts.
  • This is because undirected graphs inherently preserve more structural symmetry, making distinct molecules appear isomorphic to the model, especially in the absence of rich node features. Consequently, undirected graphs increase the likelihood of representation collapse, where different molecules are mapped to similar embeddings.

  • Additionally, atoms with the same number and type of neighbors (e.g., carbons in aromatic rings) are more prone to receiving identical embeddings in undirected graphs due to symmetric message passing, particularly over deeper GNN stacks (e.g., 5 layers).
  • In contrast, directed graphs break this symmetry, allowing for more diverse and discriminative representations, even when directionality is assigned arbitrarily, meaning now the under directed graph the same atom will have different/less number of neighbors even though of same type. Hence, the amount of aggregation is reduced for same number of convolution layers.

Eg: Aspirin molecule (information flow from left to right due internal logic rdkit atom index numbers
image

Bonds:
	Bond index: 0, Atoms: (0, 1), Type: SINGLE
	Bond index: 1, Atoms: (1, 2), Type: DOUBLE
	Bond index: 2, Atoms: (1, 3), Type: SINGLE
	Bond index: 3, Atoms: (3, 4), Type: SINGLE
	Bond index: 4, Atoms: (4, 5), Type: AROMATIC
	Bond index: 5, Atoms: (5, 6), Type: AROMATIC
...
	Bond index: 9, Atoms: (9, 10), Type: SINGLE
	Bond index: 10, Atoms: (10, 11), Type: DOUBLE
	Bond index: 11, Atoms: (10, 12), Type: SINGLE
	Bond index: 12, Atoms: (9, 4), Type: AROMATIC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug:fix fix for bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Clarification on Directed vs Undirected Graph Construction in _read_data of GraphPropertyReader
2 participants