Adds fix for sporadic CI bug in Barnes-Hut test by peterdsharpe · Pull Request #1580 · NVIDIA/physicsnemo

peterdsharpe · 2026-04-22T18:48:51Z

PhysicsNeMo Pull Request

Description

In CI, seeing a sporadic bug reported by @ktangsali (example failing run):

CI trace report

______________________ test_bh_nested_source_data_keys[2] ______________________

n_dims = 2

    @dims_params
    def test_bh_nested_source_data_keys(n_dims: int):
        """Convergence with nested TensorDict keys matching GLOBE's production format.
    
        GLOBE passes source_data structured like:
            {"physical": {"velocity": ...}, "latent": {"scalars": {"0": ...},
             "vectors": {"0": ...}}, "normals": ...}
    
        The aggregation, split_by_leaf_rank, and TensorDict.cat operations must
        handle this nesting correctly.
        """
        torch.manual_seed(DEFAULT_SEED)
        n_src, n_tgt = 30, 15
    
        source_data_ranks = {
            "physical": {"pressure": 0},
            "latent": {"scalars": {"0": 0, "1": 0}, "vectors": {"0": 1}},
            "normals": 1,
        }
        output_field_ranks = {"p": 0, "u": 1}
    
        common_kwargs = dict(
            n_spatial_dims=n_dims,
            output_field_ranks={
                k: (0 if v == "scalar" else 1) for k, v in output_field_ranks.items()
            },
            source_data_ranks=source_data_ranks,
            hidden_layer_sizes=[16],
        )
    
        bh_kernel = BarnesHutKernel(**common_kwargs, leaf_size=DEFAULT_LEAF_SIZE)
        exact_kernel = Kernel(**common_kwargs)
        exact_kernel.load_state_dict(bh_kernel.state_dict(), strict=False)
        bh_kernel.eval()
        exact_kernel.eval()
    
        torch.manual_seed(DEFAULT_SEED + 1)
        source_data = TensorDict(
            {
                "physical": TensorDict(
                    {"pressure": torch.randn(n_src)},
                    batch_size=[n_src],
                ),
                "latent": TensorDict(
                    {
                        "scalars": TensorDict(
                            {"0": torch.randn(n_src), "1": torch.randn(n_src)},
                            batch_size=[n_src],
                        ),
                        "vectors": TensorDict(
                            {"0": F.normalize(torch.randn(n_src, n_dims), dim=-1)},
                            batch_size=[n_src],
                        ),
                    },
                    batch_size=[n_src],
                ),
        }
        output_field_ranks = {"p": 0, "u": 1}
    
        common_kwargs = dict(
            n_spatial_dims=n_dims,
            output_field_ranks={
                k: (0 if v == "scalar" else 1) for k, v in output_field_ranks.items()
            },
            source_data_ranks=source_data_ranks,
            hidden_layer_sizes=[16],
        )
    
        bh_kernel = BarnesHutKernel(**common_kwargs, leaf_size=DEFAULT_LEAF_SIZE)
        exact_kernel = Kernel(**common_kwargs)
        exact_kernel.load_state_dict(bh_kernel.state_dict(), strict=False)
        bh_kernel.eval()
        exact_kernel.eval()
    
        torch.manual_seed(DEFAULT_SEED + 1)
        source_data = TensorDict(
            {
                "physical": TensorDict(
                    {"pressure": torch.randn(n_src)},
                    batch_size=[n_src],
                ),
                "latent": TensorDict(
                    {
                        "scalars": TensorDict(
                            {"0": torch.randn(n_src), "1": torch.randn(n_src)},
                            batch_size=[n_src],
                        ),
                        "vectors": TensorDict(
                            {"0": F.normalize(torch.randn(n_src, n_dims), dim=-1)},
                            batch_size=[n_src],
                        ),
                    },
                    batch_size=[n_src],
                ),
                "normals": F.normalize(torch.randn(n_src, n_dims), dim=-1),
            },
            batch_size=[n_src],
        )
    
        data = {
            "source_points": torch.randn(n_src, n_dims),
            "target_points": torch.randn(n_tgt, n_dims) * 5,
            "source_strengths": torch.rand(n_src) + 0.1,
            "reference_length": torch.ones(()),
            "source_data": source_data,
            "global_data": TensorDict({}, batch_size=[]),
        }
    
        exact_result = exact_kernel(**data)
        bh_result = bh_kernel(**data, theta=0.01)
    
        for field_name in output_field_ranks:
>           torch.testing.assert_close(
                bh_result[field_name],
                exact_result[field_name],
                atol=1e-3,
                rtol=1e-2,
                msg=f"Nested keys: {field_name!r} not close to exact at theta=0.01",
            )
E           AssertionError: Nested keys: 'p' not close to exact at theta=0.01

test/models/globe/test_barnes_hut_kernel.py:960: AssertionError

This is possibly introduced by #1494, which recently modified these files. However, CI passed on this PR, and this bug seems to occur only sporadically.

On local machines, these test seem to consistently pass.

For now, adding diagnostic code here to try reproducing this on CI and get more debugging details.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.
The CHANGELOG.md is up to date with these changes.
An issue is linked to this pull request.
If I am implementing a new model or modifying any existing model, I have followed the Models Implementation Coding Standards.

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

… consistent TensorDict structure. Update test tolerances for output comparisons to enhance robustness against numerical discrepancies.

peterdsharpe · 2026-04-23T16:10:58Z

Root cause: tensordict shipped a new release on April 20 which changed the ordering of keys when flattening tensordicts, and we didn't guard against this. Hence, causing our CI to pick up the new version, which caused a previously-passing test to fail. Testing a fix.

Adds diagnostics

a2dd916

peterdsharpe added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Apr 22, 2026

peterdsharpe mentioned this pull request Apr 23, 2026

test: loosening absolute tolerance for layer norm tests #1581

Open

6 tasks

peterdsharpe added 3 commits April 23, 2026 11:31

adds diagnostics

4809921

formatting

a7eec60

Refactor BarnesHutKernel to improve source scalar handling and ensure…

1c995b4

… consistent TensorDict structure. Update test tolerances for output comparisons to enhance robustness against numerical discrepancies.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds fix for sporadic CI bug in Barnes-Hut test#1580

Adds fix for sporadic CI bug in Barnes-Hut test#1580
peterdsharpe wants to merge 4 commits intoNVIDIA:mainfrom
peterdsharpe:psharpe/globe_bh_test_hotfix

peterdsharpe commented Apr 22, 2026 •

edited

Loading

Uh oh!

peterdsharpe commented Apr 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

peterdsharpe commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PhysicsNeMo Pull Request

Description

Checklist

Dependencies

Review Process

Uh oh!

peterdsharpe commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

peterdsharpe commented Apr 22, 2026 •

edited

Loading

peterdsharpe commented Apr 23, 2026 •

edited

Loading