Add TBE data configuration reporter to TBE forward (v3) (#4672)

gchalump · facebook-github-bot · commit 02d1bbcfc536 · 2025-08-20T00:55:33.000-07:00
Summary: X-link: facebookresearch/FBGEMM#1703 Pull Request resolved: #4672 X-link: facebookresearch/FBGEMM#1516 Pull Request resolved: #4455 Re-land attempt of D75462895 # Add TBE data configuration reporter to TBE forward call. The reporter reports TBE data configuration at the `SplitTableBatchedEmbeddingBagsCodegen` ***forward*** call. The output is a `TBEDataConfig` object, which is written to a JSON file(s). The configuration of its environment variables and an example of its usage is described below. ## Just Knobs for enablement - fbgemm_gpu/features:TBE_REPORT_INPUT_PARAMS is added for enablement of the reporter (https://www.internalfb.com/intern/justknobs/?name=fbgemm_gpu%2Ffeatures) - Default is set to `False`, enable this flag to enable reporter. - To enable it locally use: ``` jk canary set fbgemm_gpu/features:TBE_REPORT_INPUT_PARAMS --on --ttl 600 ``` ## Environment Variables --------------------- The Reporter relies on several environment variables to control its behavior. Below is a description of each variable: - **FBGEMM_REPORT_INPUT_PARAMS_INTERVAL**: - **Description**: Determines the interval at which reports are generated. This is specified in terms of the number of iterations. - **Example Value**: `1` (report every iteration) - **FBGEMM_REPORT_INPUT_PARAMS_ITER_START**: - ***Description**: Specifies the start of the iteration range to capture reports. Default 0. - ***Example Value**: `0` (start reporting from the first iteration) - **FBGEMM_REPORT_INPUT_PARAMS_ITER_END**: - ***Description**: Specifies the end of the iteration range to capture reports. Use `-1` to report until the last iteration. Default -1. - ***Example Value**: `-1` (report until the last iteration) - **FBGEMM_REPORT_INPUT_PARAMS_BUCKET**: * **Description**: Specifies the name of the Manifold bucket where the report data will be saved. * **Example Value**: `tlparse_reports` - **FBGEMM_REPORT_INPUT_PARAMS_PATH_PREFIX**: - **Description**: Defines the path prefix where the report files will be stored. Path will be created if not exist. - **Example Value**: `tree/tests/` ## Use Cases - FileStore - General - Auto-create output directories if not exist. - fb-internal: - Only export to manifold. - Assert error, if the flag is set but failed to initialize manifold connection. (missing backend or manifold bucket is not exist) - OSS - Will use local FileStore to store the output ## Example Usage ------------- Below is an example command demonstrating how to use the FBGEMM Reporter with specific environment variable settings: ``` FBGEMM_REPORT_INPUT_PARAMS_INTERVAL=2 FBGEMM_REPORT_INPUT_PARAMS_ITER_START=3 FBGEMM_REPORT_INPUT_PARAMS_BUCKET=tlparse_reports FBGEMM_REPORT_INPUT_PARAMS_PATH_PREFIX=tree/tests/ buck2 run mode/opt //deeplearning/fbgemm/fbgemm_gpu/bench:split_table_batched_embeddings -- device --iters 2 ``` **Explanation** The above setting will report `iter 3` and `iter 5` * **FBGEMM_REPORT_INPUT_PARAMS_INTERVAL=2**: The reporter will generate a report every 2 iterations. * **FBGEMM_REPORT_INPUT_PARAMS_ITER_START=0**: The reporter will start generating reports from the first iteration. * **FBGEMM_REPORT_INPUT_PARAMS_ITER_END=-1 (Default)**: The reporter will continue to generate reports until the last iteration interval. * **FBGEMM_REPORT_INPUT_PARAMS_BUCKET=tlparse_reports**: The reports will be saved in the `tlparse_reports` bucket. * **FBGEMM_REPORT_INPUT_PARAMS_PATH_PREFIX=tree/tests/**: The reports will be stored with the path prefix `tree/tests/`. For Manifold make sure all folders within the path exist. **Note on Benchmark example** Note that with the `--iters 2` option, the benchmark will execute 6 forward calls (2 iterations plus 1 warmup) for the forward benchmark and another 3 calls (2 iterations plus 1 warmup) for the backward benchmark. Iteration starts from 0. --- --- ## Other includes changes in this Diff: - Updates build dependency of tbe_data_config* files - Remove `shutil` and `numpy.random` lib as it cause uncompatiblity error. - Add non-OSS test, writing extracted config data json file to Manifold Differential Revision: D79758603
diff --git a/fbgemm_gpu/fbgemm_gpu/split_table_batched_embeddings_ops_training.py b/fbgemm_gpu/fbgemm_gpu/split_table_batched_embeddings_ops_training.py
@@ -1441,6 +1441,11 @@ def __init__(  # noqa C901
             self._debug_print_input_stats_factory()
         )
 
+        # Get a reporter function pointer
+        self._report_input_params: Callable[..., None] = (
+            self.__report_input_params_factory()
+        )
+
         if optimizer == OptimType.EXACT_SGD and self.use_writeback_bwd_prehook:
             # Register writeback hook for Exact_SGD optimizer
             self.log(
@@ -1953,6 +1958,19 @@ def forward(  # noqa: C901
         # Print input stats if enable (for debugging purpose only)
         self._debug_print_input_stats(indices, offsets, per_sample_weights)
 
+        # Extract and Write input stats if enable
+        if self._report_input_params is not None:
+            self._report_input_params(
+                feature_rows=self.rows_per_table,
+                feature_dims=self.feature_dims,
+                iteration=self.iter_cpu.item() if hasattr(self, "iter_cpu") else 0,
+                indices=indices,
+                offsets=offsets,
+                op_id=self.uuid,
+                per_sample_weights=per_sample_weights,
+                batch_size_per_feature_per_rank=batch_size_per_feature_per_rank,
+            )
+
         if not is_torchdynamo_compiling():
             # Mutations of nn.Module attr forces dynamo restart of Analysis which increases compilation time
 
@@ -3829,6 +3847,30 @@ def _debug_print_input_stats_factory_null(
             return _debug_print_input_stats_factory_impl
         return _debug_print_input_stats_factory_null
 
+    @torch.jit.ignore
+    def __report_input_params_factory(
+        self,
+    ) -> Optional[Callable[..., None]]:
+        """
+        This function returns a function pointer based on the environment variable `FBGEMM_REPORT_INPUT_PARAMS_INTERVAL`.
+
+        If `FBGEMM_REPORT_INPUT_PARAMS_INTERVAL` is set to a value greater than 0, it returns a function pointer that:
+        - Reports input parameters (TBEDataConfig).
+        - Writes the output as a JSON file.
+
+        If `FBGEMM_REPORT_INPUT_PARAMS_INTERVAL` is not set or is set to 0, it returns a dummy function pointer that performs no action.
+        """
+        try:
+            if self._feature_is_enabled(FeatureGateName.TBE_REPORT_INPUT_PARAMS):
+                from fbgemm_gpu.tbe.stats import TBEBenchmarkParamsReporter
+
+                reporter = TBEBenchmarkParamsReporter.create()
+                return reporter.report_stats
+        except Exception:
+            return None
+
+        return None
+
 
 class DenseTableBatchedEmbeddingBagsCodegen(nn.Module):
     """
diff --git a/fbgemm_gpu/test/tbe/stats/tbe_bench_params_reporter_test.py b/fbgemm_gpu/test/tbe/stats/tbe_bench_params_reporter_test.py
@@ -8,12 +8,15 @@
 # pyre-strict
 
 import unittest
+from typing import Optional
+from unittest.mock import patch
 
 import fbgemm_gpu
 
 import hypothesis.strategies as st
 
 import torch
+from fbgemm_gpu.config import FeatureGateName
 from fbgemm_gpu.split_table_batched_embeddings_ops_common import (
     ComputeDevice,
     EmbeddingLocation,
@@ -38,6 +41,7 @@
 from hypothesis import given, settings
 
 from .. import common  # noqa E402
+from ..common import running_in_oss
 
 try:
     # pyre-fixme[16]: Module `fbgemm_gpu` has no attribute `open_source`.
@@ -147,6 +151,104 @@ def test_report_stats(
             == tbeconfig.indices_params.offset_dtype
         ), "Extracted config does not match the original TBEDataConfig"
 
+    # pyre-ignore[56]
+    @given(
+        T=st.integers(1, 10),
+        E=st.integers(100, 10000),
+        D=st.sampled_from([32, 64, 128, 256]),
+        L=st.integers(1, 10),
+        B=st.integers(20, 100),
+    )
+    @settings(max_examples=1, deadline=None)
+    @unittest.skipIf(*running_in_oss)
+    def test_report_fb_files(
+        self,
+        T: int,
+        E: int,
+        D: int,
+        L: int,
+        B: int,
+    ) -> None:
+        """
+        Test writing extrcted TBEDataConfig to FB FileStore
+        """
+        from fbgemm_gpu.fb.utils.manifold_wrapper import FileStore
+
+        # Initialize the reporter
+        bucket = "tlparse_reports"
+        path_prefix = "tree/unit_tests/"
+
+        # Generate a TBEDataConfig
+        tbeconfig = TBEDataConfig(
+            T=T,
+            E=E,
+            D=D,
+            mixed_dim=False,
+            weighted=False,
+            batch_params=BatchParams(B=B),
+            indices_params=IndicesParams(
+                heavy_hitters=torch.tensor([]),
+                zipf_q=0.1,
+                zipf_s=0.1,
+                index_dtype=torch.int64,
+                offset_dtype=torch.int64,
+            ),
+            pooling_params=PoolingParams(L=L),
+            use_cpu=not torch.cuda.is_available(),
+        )
+
+        embedding_location = (
+            EmbeddingLocation.DEVICE
+            if torch.cuda.is_available()
+            else EmbeddingLocation.HOST
+        )
+
+        # Generate the embedding dimension list
+        _, Ds = generate_embedding_dims(tbeconfig)
+
+        with patch(
+            "torch.ops.fbgemm.check_feature_gate_key"
+        ) as mock_check_feature_gate_key:
+            # Mock the return value for TBE_REPORT_INPUT_PARAMS
+            def side_effect(feature_name: str) -> Optional[bool]:
+                if feature_name == FeatureGateName.TBE_REPORT_INPUT_PARAMS.name:
+                    return True
+
+            mock_check_feature_gate_key.side_effect = side_effect
+
+            # Generate the embedding operation
+            embedding_op = SplitTableBatchedEmbeddingBagsCodegen(
+                [
+                    (
+                        tbeconfig.E,
+                        D,
+                        embedding_location,
+                        (
+                            ComputeDevice.CUDA
+                            if torch.cuda.is_available()
+                            else ComputeDevice.CPU
+                        ),
+                    )
+                    for D in Ds
+                ],
+            )
+
+            embedding_op = embedding_op.to(get_device())
+
+            # Generate indices and offsets
+            request = generate_requests(tbeconfig, 1)[0]
+
+            # Execute the embedding operation with reporting flag enable
+            embedding_op.forward(request.indices, request.offsets)
+
+            # Check if the file was written to Manifold
+            store = FileStore(bucket)
+            path = f"{path_prefix}tbe-{embedding_op.uuid}-config-estimation-{embedding_op.iter_cpu.item()}.json"
+            assert store.exists(path), f"{path} not exists"
+
+            # Clenaup, delete the file
+            store.remove(path)
+
 
 if __name__ == "__main__":
     unittest.main()