[not for land] debug accuracy logging for float8 training #2701

vkuzo · 2025-08-06T14:19:29Z

Summary:

A lightweight logging flag to log the SQNR between the float8 gemm
output and the bf16 gemm output.

two step usage:

set the _enable_debug_logging option on Float8LinearConfig
after model is converted to float8, call _populate_debug_fqns to populate debug FQN names

example usage:

import torch
import torch.nn as nn
from torchao.float8.config import Float8LinearConfig
from torchao.float8.float8_linear_utils import _populate_debug_fqns, convert_to_float8_training

x = torch.randn(1, 16, 16, device="cuda", dtype=torch.bfloat16)
m = nn.Sequential(
    nn.Linear(16, 32, bias=False, device="cuda", dtype=torch.bfloat16),
    nn.Sequential(
        nn.ReLU(),
        nn.Linear(32, 64, bias=False, device="cuda", dtype=torch.bfloat16),
    ),
)   
config = Float8LinearConfig.from_recipe_name(recipe_name)
object.__setattr__(config, "_enable_debug_logging", True)
m = convert_to_float8_training(m, config=config)
_populate_debug_fqns(m)
m = torch.compile(m)
y = m(x)
y.sum().backward()

Test Plan:

> pytest test/float8/test_base.py -s -x -k test_debug_logging
...
test/float8/test_base.py fqn: 0, gemm_name: output, sqnr: 29.125                    
fqn: 1.1, gemm_name: output, sqnr: 28.5                                                                          
fqn: 1.1, gemm_name: grad_input, sqnr: 33.5                                                                      
fqn: 1.1, gemm_name: grad_weight, sqnr: 38.5                                                                     
fqn: 0, gemm_name: grad_input, sqnr: 22.5                                                                        
fqn: 0, gemm_name: grad_weight, sqnr: 23.875

Reviewers:

Subscribers:

Tasks:

Tags:

[ghstack-poisoned]

vkuzo · 2025-08-06T14:19:31Z

Stack from ghstack (oldest at bottom):

-> [not for land] debug accuracy logging for float8 training #2701

pytorch-bot · 2025-08-06T14:19:34Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2701

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 10 New Failures

As of commit 1bbffe7 with merge base 5d99ce4 ():

NEW FAILURES - The following jobs have failed:

Run 1xH100 Tests / test (H100, linux.aws.h100, --pre torch torchvision torchaudio fbgemm-gpu-genai --index-url https... / linux-job (gh)
test/dtypes/test_affine_quantized_float.py::TestAffineQuantizedFloat8Compile::test_expected_kernels_on_gpu_granularity1_torch_compile_mode_reduce-overhead
Run 4xH100 tests / test (H100, linux.aws.h100.4, --pre torch torchvision torchaudio --index-url https://download.pyt... / linux-job (gh)
RuntimeError: Command docker exec -t 78b2afbcd4e18b06de92e004958e5431f23d3e6df12e64ee4e65e3673f42ebea /exec failed with exit code 1
Run Regression Tests / test (CPU 2.5.1, linux.4xlarge, torch==2.5.1 --index-url https://download.pytorch.org/whl/cpu, cp... / linux-job (gh)
test/float8/test_base.py::TestFloat8Linear::test_debug_logging[Float8LinearRecipeName.ROWWISE_WITH_GW_HP]
Run Regression Tests / test (CPU 2.6, linux.4xlarge, torch==2.6.0 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh)
test/float8/test_base.py::TestFloat8Linear::test_debug_logging[Float8LinearRecipeName.ROWWISE_WITH_GW_HP]
Run Regression Tests / test (CPU 2.7, linux.4xlarge, torch==2.7.0 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh)
test/float8/test_base.py::TestFloat8Linear::test_debug_logging[Float8LinearRecipeName.ROWWISE_WITH_GW_HP]
Run Regression Tests / test (CUDA 2.5.1, linux.g5.12xlarge.nvidia.gpu, torch==2.5.1 --index-url https://download.pytorch... / linux-job (gh)
test/float8/test_base.py::TestFloat8Linear::test_debug_logging[Float8LinearRecipeName.ROWWISE_WITH_GW_HP]
Run Regression Tests / test (CUDA 2.6, linux.g5.12xlarge.nvidia.gpu, torch==2.6.0, cuda, 12.6) / linux-job (gh)
test/float8/test_base.py::TestFloat8Linear::test_debug_logging[Float8LinearRecipeName.ROWWISE_WITH_GW_HP]
Run Regression Tests / test (CUDA 2.7, linux.g5.12xlarge.nvidia.gpu, torch==2.7.0, cuda, 12.6) / linux-job (gh)
test/float8/test_base.py::TestFloat8Linear::test_debug_logging[Float8LinearRecipeName.ROWWISE_WITH_GW_HP]
Run Regression Tests / test-nightly (CPU Nightly, linux.4xlarge, --pre torch --index-url https://download.pytorch.org/wh... / linux-job (gh)
test/float8/test_base.py::TestFloat8Linear::test_debug_logging[Float8LinearRecipeName.ROWWISE_WITH_GW_HP]
Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh)
test/float8/test_base.py::TestFloat8Linear::test_debug_logging[Float8LinearRecipeName.ROWWISE_WITH_GW_HP]

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Summary: A lightweight logging flag to log the SQNR between the float8 gemm output and the bf16 gemm output. Test Plan: ```bash ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: f419ea2 ghstack-comment-id: 3160382784 Pull Request resolved: #2701

facebook-github-bot · 2025-08-06T14:24:07Z

@vkuzo has imported this pull request. If you are a Meta employee, you can view this in D79724877.

[ghstack-poisoned]

Summary: A lightweight logging flag to log the SQNR between the float8 gemm output and the bf16 gemm output. Test Plan: ```bash ``` Reviewers: Subscribers: Tasks: Tags: ghstack-source-id: dce0fc9 ghstack-comment-id: 3160382784 Pull Request resolved: #2701

facebook-github-bot · 2025-08-06T18:06:50Z

@vkuzo has imported this pull request. If you are a Meta employee, you can view this in D79724877.

Update

89e299f

[ghstack-poisoned]

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 6, 2025

vkuzo added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Aug 6, 2025

Update

1bbffe7

[ghstack-poisoned]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[not for land] debug accuracy logging for float8 training #2701

[not for land] debug accuracy logging for float8 training #2701

Uh oh!

vkuzo commented Aug 6, 2025 •

edited

Loading

Uh oh!

vkuzo commented Aug 6, 2025 •

edited

Loading

Uh oh!

pytorch-bot bot commented Aug 6, 2025 •

edited

Loading

Uh oh!

facebook-github-bot commented Aug 6, 2025

Uh oh!

facebook-github-bot commented Aug 6, 2025

Uh oh!

Uh oh!

[not for land] debug accuracy logging for float8 training #2701

Are you sure you want to change the base?

[not for land] debug accuracy logging for float8 training #2701

Uh oh!

Conversation

vkuzo commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vkuzo commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2701

❌ 10 New Failures

Uh oh!

facebook-github-bot commented Aug 6, 2025

Uh oh!

facebook-github-bot commented Aug 6, 2025

Uh oh!

Uh oh!

vkuzo commented Aug 6, 2025 •

edited

Loading

vkuzo commented Aug 6, 2025 •

edited

Loading

pytorch-bot bot commented Aug 6, 2025 •

edited

Loading