Skip to content

Adds fix for sporadic CI bug in Barnes-Hut test#1580

Draft
peterdsharpe wants to merge 4 commits intoNVIDIA:mainfrom
peterdsharpe:psharpe/globe_bh_test_hotfix
Draft

Adds fix for sporadic CI bug in Barnes-Hut test#1580
peterdsharpe wants to merge 4 commits intoNVIDIA:mainfrom
peterdsharpe:psharpe/globe_bh_test_hotfix

Conversation

@peterdsharpe
Copy link
Copy Markdown
Collaborator

@peterdsharpe peterdsharpe commented Apr 22, 2026

PhysicsNeMo Pull Request

Description

In CI, seeing a sporadic bug reported by @ktangsali (example failing run):

CI trace report
______________________ test_bh_nested_source_data_keys[2] ______________________

n_dims = 2

    @dims_params
    def test_bh_nested_source_data_keys(n_dims: int):
        """Convergence with nested TensorDict keys matching GLOBE's production format.
    
        GLOBE passes source_data structured like:
            {"physical": {"velocity": ...}, "latent": {"scalars": {"0": ...},
             "vectors": {"0": ...}}, "normals": ...}
    
        The aggregation, split_by_leaf_rank, and TensorDict.cat operations must
        handle this nesting correctly.
        """
        torch.manual_seed(DEFAULT_SEED)
        n_src, n_tgt = 30, 15
    
        source_data_ranks = {
            "physical": {"pressure": 0},
            "latent": {"scalars": {"0": 0, "1": 0}, "vectors": {"0": 1}},
            "normals": 1,
        }
        output_field_ranks = {"p": 0, "u": 1}
    
        common_kwargs = dict(
            n_spatial_dims=n_dims,
            output_field_ranks={
                k: (0 if v == "scalar" else 1) for k, v in output_field_ranks.items()
            },
            source_data_ranks=source_data_ranks,
            hidden_layer_sizes=[16],
        )
    
        bh_kernel = BarnesHutKernel(**common_kwargs, leaf_size=DEFAULT_LEAF_SIZE)
        exact_kernel = Kernel(**common_kwargs)
        exact_kernel.load_state_dict(bh_kernel.state_dict(), strict=False)
        bh_kernel.eval()
        exact_kernel.eval()
    
        torch.manual_seed(DEFAULT_SEED + 1)
        source_data = TensorDict(
            {
                "physical": TensorDict(
                    {"pressure": torch.randn(n_src)},
                    batch_size=[n_src],
                ),
                "latent": TensorDict(
                    {
                        "scalars": TensorDict(
                            {"0": torch.randn(n_src), "1": torch.randn(n_src)},
                            batch_size=[n_src],
                        ),
                        "vectors": TensorDict(
                            {"0": F.normalize(torch.randn(n_src, n_dims), dim=-1)},
                            batch_size=[n_src],
                        ),
                    },
                    batch_size=[n_src],
                ),
        }
        output_field_ranks = {"p": 0, "u": 1}
    
        common_kwargs = dict(
            n_spatial_dims=n_dims,
            output_field_ranks={
                k: (0 if v == "scalar" else 1) for k, v in output_field_ranks.items()
            },
            source_data_ranks=source_data_ranks,
            hidden_layer_sizes=[16],
        )
    
        bh_kernel = BarnesHutKernel(**common_kwargs, leaf_size=DEFAULT_LEAF_SIZE)
        exact_kernel = Kernel(**common_kwargs)
        exact_kernel.load_state_dict(bh_kernel.state_dict(), strict=False)
        bh_kernel.eval()
        exact_kernel.eval()
    
        torch.manual_seed(DEFAULT_SEED + 1)
        source_data = TensorDict(
            {
                "physical": TensorDict(
                    {"pressure": torch.randn(n_src)},
                    batch_size=[n_src],
                ),
                "latent": TensorDict(
                    {
                        "scalars": TensorDict(
                            {"0": torch.randn(n_src), "1": torch.randn(n_src)},
                            batch_size=[n_src],
                        ),
                        "vectors": TensorDict(
                            {"0": F.normalize(torch.randn(n_src, n_dims), dim=-1)},
                            batch_size=[n_src],
                        ),
                    },
                    batch_size=[n_src],
                ),
                "normals": F.normalize(torch.randn(n_src, n_dims), dim=-1),
            },
            batch_size=[n_src],
        )
    
        data = {
            "source_points": torch.randn(n_src, n_dims),
            "target_points": torch.randn(n_tgt, n_dims) * 5,
            "source_strengths": torch.rand(n_src) + 0.1,
            "reference_length": torch.ones(()),
            "source_data": source_data,
            "global_data": TensorDict({}, batch_size=[]),
        }
    
        exact_result = exact_kernel(**data)
        bh_result = bh_kernel(**data, theta=0.01)
    
        for field_name in output_field_ranks:
>           torch.testing.assert_close(
                bh_result[field_name],
                exact_result[field_name],
                atol=1e-3,
                rtol=1e-2,
                msg=f"Nested keys: {field_name!r} not close to exact at theta=0.01",
            )
E           AssertionError: Nested keys: 'p' not close to exact at theta=0.01

test/models/globe/test_barnes_hut_kernel.py:960: AssertionError

This is possibly introduced by #1494, which recently modified these files. However, CI passed on this PR, and this bug seems to occur only sporadically.

On local machines, these test seem to consistently pass.

For now, adding diagnostic code here to try reproducing this on CI and get more debugging details.

Checklist

Dependencies

Review Process

All PRs are reviewed by the PhysicsNeMo team before merging.

Depending on which files are changed, GitHub may automatically assign a maintainer for review.

We are also testing AI-based code review tools (e.g., Greptile), which may add automated comments with a confidence score.
This score reflects the AI’s assessment of merge readiness and is not a qualitative judgment of your work, nor is
it an indication that the PR will be accepted / rejected.

AI-generated feedback should be reviewed critically for usefulness.
You are not required to respond to every AI comment, but they are intended to help both authors and reviewers.
Please react to Greptile comments with 👍 or 👎 to provide feedback on their accuracy.

@peterdsharpe peterdsharpe added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Apr 22, 2026
… consistent TensorDict structure. Update test tolerances for output comparisons to enhance robustness against numerical discrepancies.
@peterdsharpe
Copy link
Copy Markdown
Collaborator Author

peterdsharpe commented Apr 23, 2026

Root cause: tensordict shipped a new release on April 20 which changed the ordering of keys when flattening tensordicts, and we didn't guard against this. Hence, causing our CI to pick up the new version, which caused a previously-passing test to fail. Testing a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

5 - DO NOT MERGE Hold off on merging; see PR for details

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant