Jk/log grad norms/log grad norms #1068

Jubeku · 2025-10-09T16:13:21Z

Description

This PR is based on @sophie-xhonneux's log_grad_norm branch in #685, modified to allow logging gradients when running in parallel on multiple GPUs with FSDP2.

Issue Number

Closes #688

Is this PR a draft? Mark it as draft.

Checklist before asking for review

I have performed a self-review of my code
My changes comply with basic sanity checks:
- I have fixed formatting issues with ./scripts/actions.sh lint
- I have run unit tests with ./scripts/actions.sh unit-test
- I have documented my code and I have updated the docstrings.
- I have added unit tests, if relevant
I have tried my changes with data and code:
- I have run the integration tests with ./scripts/actions.sh integration-test
- (bigger changes) I have run a full training and I have written in the comment the run_id(s): launch-slurm.py --time 60
- (bigger changes and experiments) I have shared a hegdedoc in the github issue with all the configurations and runs for this experiments
I have informed and aligned with people impacted by my change:
- for config changes: the MatterMost channels and/or a design doc
- for changes of dependencies: the MatterMost software development channel

Jubeku · 2025-10-09T16:16:52Z

src/weathergen/train/trainer.py

+        """
+        self.last_grad_norm = (
+            total_norm.full_tensor().item() if self.cf.world_size > 1 else total_norm.item()
+        )


As mentioned here, full_tensor().item() is needed in parallel runs with FSDP2.

I tested this by logging both ways of calculating:

000 : 00010/02048 : 000010 : loss = 1.0287E+00 (lr=1.64E-06, gradient norm=0.983, gradient norm FT=1.403, s/sec=0.236) ERA5 : 1.0287E+00 000 : 00020/02048 : 000020 : loss = 1.0101E+00 (lr=3.34E-06, gradient norm=0.587, gradient norm FT=0.817, s/sec=0.435) ERA5 : 1.0101E+00

Jubeku · 2025-10-09T16:18:15Z

src/weathergen/train/trainer.py

+                    if self.cf.world_size > 1
+                    else param.grad.norm().item()
+                )
+


Same as above, we also need .full_tensor().item() here in multi-gpu mode.

Tested it by printing both versions on 2 GPUs:

print(".item():", param.grad.norm().item()) print(".full_tensor().item()", param.grad.norm().full_tensor().item()) .item(): 0.028306283056735992 .item(): 0.022433193400502205 .full_tensor().item() 0.03611777722835541 .full_tensor().item() 0.03611777722835541

tjhunter · 2025-10-13T08:40:00Z

src/weathergen/train/trainer.py

+                grad_norms["grad_norm_" + name] = (
+                    param.grad.norm().full_tensor().item()
+                    if self.cf.world_size > 1
+                    else param.grad.norm().item()


shouldn't you divide by the number of items in the gradient? otherwise, if every component in the gradient is equal, you are biased by batching computations

Not sure I follow. But as far as I know the gradient norm logging is correct and people do not commonly account for the number of dimensions, as for batching this is handle in the forward pass and thus is automatically dealt with during backprop.

sophie-xhonneux

Minor things to fix in the comments, I trust the will happen and thus already approve the PR. If logging is off, there should be no effect on the runs.

config/default_config.yml

src/weathergen/train/trainer.py

sophie-xhonneux · 2025-10-13T09:37:45Z

src/weathergen/train/trainer.py

+                grad_norms["grad_norm_" + name] = (
+                    param.grad.norm().full_tensor().item()
+                    if self.cf.world_size > 1
+                    else param.grad.norm().item()


Not sure I follow. But as far as I know the gradient norm logging is correct and people do not commonly account for the number of dimensions, as for batching this is handle in the forward pass and thus is automatically dealt with during backprop.

Jubeku · 2025-10-13T13:34:55Z

@sophie-xhonneux, @tjhunter, should we keep the plot_grad_norms.py in the utils/ folder or should we rather move it to the private repo or add a wiki page on options to plot the grad norms.

src/weathergen/train/trainer.py

Jubeku · 2025-10-21T15:37:02Z

@tjhunter, here is a sample of the metrics logs when gradient logging in on. Should we log them in a separate file? Happy to have a quick chat on that whenever you are available.
vqu7n54p_train_metrics.json

Jubeku · 2025-10-23T10:38:05Z

All grad_norm.-metrics will be excluded from MLFlow upload through this PR: https://gitlab.jsc.fz-juelich.de/esde/WeatherGenerator-private/-/merge_requests/78 cc @tjhunter

src/weathergen/train/trainer.py

* Log gradient norms * Prototype for recording grad norms * Address review changes + hide behind feature flag * Final fixes including backward compatibility * Ruff * More ruff stuff * forecast config with small decoder * fixed uv.lock * test gradient logging on mutli gpus * update uv.lock to latest develop version * revert to default confit * add comment on FSDP2 specifics * move plot grad script to private repo * rm seaborn from pyproject * updating terminal and metrics loggin, add get_tensor_item fct * check for DTensor instead of world size * revert forecast fct, fix in separate PR * rename grad_norm log names to exclude from MLFlow * add log_grad_norms to default config --------- Co-authored-by: sophiex <[email protected]>

sophie-xhonneux and others added 12 commits August 6, 2025 12:24

Log gradient norms

a0039ec

Prototype for recording grad norms

e83903b

Address review changes + hide behind feature flag

d2995b4

Final fixes including backward compatibility

26c6869

Merge branch 'develop' into sophiex/dev/log-grad-norms

66da0d7

Ruff

9a66f72

More ruff stuff

22a6fd7

Merge branch 'develop' into jk/log-grad-norms/log-grad-norms

87e7d3b

forecast config with small decoder

754d31c

Merge branch 'develop' into jk/log-grad-norms/log-grad-norms

cd7948f

fixed uv.lock

7c756a3

test gradient logging on mutli gpus

41716a6

github-project-automation bot added this to WeatherGen-dev Oct 9, 2025

github-actions bot added the model Related to model training or definition (not generic infra) label Oct 9, 2025

Jubeku commented Oct 9, 2025

View reviewed changes

Jubeku self-assigned this Oct 9, 2025

tjhunter reviewed Oct 13, 2025

View reviewed changes

Jubeku requested a review from sophie-xhonneux October 13, 2025 09:25

sophie-xhonneux approved these changes Oct 13, 2025

View reviewed changes

Jubeku added 2 commits October 13, 2025 13:24

update uv.lock to latest develop version

8bdbac4

revert to default confit

da92f8f

Jubeku and others added 5 commits October 13, 2025 13:54

add comment on FSDP2 specifics

a072c35

Merge branch 'develop' into jk/log-grad-norms/log-grad-norms

6d477be

Merge branch 'develop' into jk/log-grad-norms/log-grad-norms

b30b69a

move plot grad script to private repo

c8fadf6

rm seaborn from pyproject

8bd7383

clessig reviewed Oct 19, 2025

View reviewed changes

src/weathergen/train/trainer.py Outdated Show resolved Hide resolved

src/weathergen/train/trainer.py Outdated Show resolved Hide resolved

src/weathergen/train/trainer.py Outdated Show resolved Hide resolved

Merge branch 'develop' into jk/log-grad-norms/log-grad-norms

37a428d

Jubeku added 3 commits October 21, 2025 14:26

updating terminal and metrics loggin, add get_tensor_item fct

9892dfa

check for DTensor instead of world size

2885062

revert forecast fct, fix in separate PR

cbb1c85

Jubeku requested a review from clessig October 21, 2025 15:37

Jubeku marked this pull request as ready for review October 21, 2025 15:39

Jubeku added 3 commits October 21, 2025 17:43

Merge branch 'develop' into jk/log-grad-norms/log-grad-norms

2bf714b

rename grad_norm log names to exclude from MLFlow

75749df

Merge branch 'develop' into jk/log-grad-norms/log-grad-norms

8b25312

clessig approved these changes Oct 24, 2025

View reviewed changes

src/weathergen/train/trainer.py Show resolved Hide resolved

Jubeku added 2 commits October 24, 2025 15:32

add log_grad_norms to default config

f1c24fa

Merge branch 'develop' into jk/log-grad-norms/log-grad-norms

f1ff748

Jubeku merged commit ecaffd2 into develop Oct 24, 2025
5 checks passed

github-project-automation bot moved this to Done in WeatherGen-dev Oct 24, 2025

Jubeku deleted the jk/log-grad-norms/log-grad-norms branch October 24, 2025 13:38

Uh oh!

Jk/log grad norms/log grad norms #1068

Jk/log grad norms/log grad norms #1068

Conversation

Jubeku commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issue Number

Checklist before asking for review

Uh oh!

Jubeku Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Jubeku Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

tjhunter Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

sophie-xhonneux Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

sophie-xhonneux left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sophie-xhonneux Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

Jubeku commented Oct 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Jubeku commented Oct 21, 2025

Uh oh!

Jubeku commented Oct 23, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Jubeku commented Oct 9, 2025 •

edited

Loading