Add SarBp optimizations and accuracy improvements by tbensonatl · Pull Request #1140 · NVIDIA/MatX

tbensonatl · 2026-03-19T02:17:26Z

This PR includes several SarBp performance optimizations for the fltflt case:

Adds fltflt_sqrt_fast(), which uses fewer operations at a very slight accuracy cost.
Adds fltflt_norm3d(), which uses fewer normalizations and calls fltflt_sqrt_fast()
Splits the calculation of bin such that more terms can be precomputed and stored in shared memory. Reduced inner loop bin calculation from ~24 FLOPs to ~18.

In addition, this PR changes the speed of light constant to the SI speed of light in a vacuum. The previous value was slightly (~0.03%) lower due to slower propagation through the atmosphere. This PR also adjusts the computation of the weight w to preserve more bits. Previously, the mixed and fltflt implementations computed bin as:

bin = static_cast<loose_compute_t>(diffR * dr_inv) + bin_offset;
w = bin - ::floor(bin)

However, with large bin counts, this can leave relatively few bits of precision for w. The fltflt and mixed variants have been adjusted to preserve more accuracy at the cost of performance. All told, the fltflt version is ~18% faster due to the optimizations, but the mixed-precision version is slower due to increased use of FP64. In the future, a new option may be added to reduce the precision of the bin calculation for scenarios/ranges where that makes sense.

This PR includes several SarBp performance optimizations for the fltflt case: - Adds fltflt_sqrt_fast(), which uses fewer operations at a very slight accuracy cost. - Adds fltflt_norm3d(), which uses fewer normalizations and calls fltflt_sqrt_fast() - Splits the calculation of bin such that more terms can be precomputed and stored in shared memory. Reduced inner loop bin calculation from ~24 FLOPs to ~18. In addition, this PR adjusts he computation of the weight w to preserve more bits. Previously, the mixed and fltflt implementations computed bin as: bin = static_cast<loose_compute_t>(diffR * dr_inv) + bin_offset; w = bin - ::floor(bin) However, with large bin counts, this can leave relatively few bits of precision for w. The fltflt and mixed variants have been adjusted to preserve more accuracy at the cost of performance. All told, the fltflt version is ~15% faster due to the optimizations, but the mixed-precision version is slower due to increased use of FP64. In the future, a new option may be added to reduce the precision of the bin calculation for scenarios/ranges where that makes sense. Signed-off-by: Thomas Benson <tbenson@nvidia.com>

copy-pr-bot · 2026-03-19T02:17:30Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

tbensonatl · 2026-03-19T02:17:39Z

/build

greptile-apps · 2026-03-19T02:22:59Z

Greptile Summary

This PR delivers SAR back-projection (FloatFloat path) performance improvements (~18% faster) alongside accuracy improvements for all precision modes. It introduces fltflt_sqrt_fast (a ~7-FLOP, ≥44-bit-precision square root), fltflt_norm3d (optimized 3D Euclidean norm that defers normalizations), precomputes (-r_to_mcp * dr_inv + bin_offset) into shared memory to shrink the inner-loop bin computation, corrects the speed-of-light constant to the SI-defined value (299792458.0 m/s), and improves w precision for the mixed/FloatFloat paths by computing floor at higher precision before casting.

New fltflt_sqrt_fast: Uses a single FMA for the Newton-Raphson residual instead of full fltflt subtraction, reducing cost from ~35 to ~7 FLOPs at a small (≤3-bit) accuracy trade-off. Well-tested with zero, small, large, and very-large inputs.
New fltflt_norm3d: Exactly squares all three .hi components via fltflt_two_prod_fma, accumulates 8 low-order correction terms (including the 2·hi·lo cross terms) into a single float, performs one normalization, then applies fltflt_sqrt_fast. The dropped lo² terms are O(ε²) relative to the result — below the 44-bit precision floor — making this a correct approximation.
Shared memory precomputation: Moves (-r_to_mcp * dr_inv + bin_offset) out of the per-pixel inner loop into the shared-memory load phase, saving ~6 FLOPs per pulse per pixel.
Accuracy fix for w: Non-FloatFloat paths now compute floor(bin) in the strict precision type (double or float) before casting, preserving more fractional bits for w. FloatFloat uses a frac/adjust scheme that correctly handles bin.lo boundary crossings.
Speed-of-light constant: Updated from the atmospheric value (2.997291625e8) to the exact SI vacuum value (299792458.0), with a note that atmospheric correction is the caller's responsibility. SAR test tolerances are relaxed by 5% accordingly.
Dead code in benchmark script: parse_benchmark_output_no_type is defined but never called because fltflt_only_benchmarks is always an empty set (see inline comment).

Confidence Score: 4/5

Safe to merge with one minor dead-code cleanup; all algorithmic changes are well-justified, tested, and documented.
The numerical algorithms (fltflt_sqrt_fast, fltflt_norm3d) are mathematically sound — preconditions for fltflt_fast_two_sum are provably satisfied, the dropped lo² terms are negligibly small, and the unit tests cover zero, small, large, and mixed-sign inputs. The SAR kernel refactoring is logically correct and the boundary check change (bin_floor_int >= 0 && < num_range_bins-1) is actually more accurate than the old max_bin_f = num_range_bins - 2 check. The only non-trivial finding is dead code in the benchmark script, which does not affect runtime correctness.
bench/scripts/run_fltflt_benchmarks.py — contains unreachable parse_benchmark_output_no_type function.

Important Files Changed

Filename	Overview
include/matx/kernels/fltflt.h	Adds `fltflt_sqrt_fast` (~7 FLOPs, ~44-45 bit precision) and `fltflt_norm3d` (optimized 3D Euclidean norm via deferred normalization). Both functions are well-documented, handle the zero case, and correctly satisfy `fltflt_fast_two_sum`'s precondition (`
include/matx/kernels/sar_bp.cuh	FloatFloat bin computation refactored to precompute `(-r_to_mcp * dr_inv + bin_offset)` in shared memory (reducing inner-loop FLOPs from ~24 to ~18). New frac/adjust logic for extracting bin floor and `w` from a fltflt bin value is mathematically correct for any `adjust` value. Speed-of-light constant updated to SI-defined value (299792458.0 m/s). `dr_inv` is now passed as `fltflt` directly for the FloatFloat path.
bench/scripts/run_fltflt_benchmarks.py	Adds `sqrt_fast` and `norm3d` to the benchmark list and defines `parse_benchmark_output_no_type` for fltflt-only benchmarks. However, `fltflt_only_benchmarks` is always an empty set, making `parse_benchmark_output_no_type` permanently unreachable dead code.
bench/00_misc/fltflt_arithmetic.cu	Adds `iterative_sqrt_fast_kernel` and `iterative_norm3d_kernel` benchmarks with proper ILP dependency chains. The `norm3d` kernel feeds results back into `dx` with a float offset subtraction (explicitly chosen over fltflt subtraction) to bound register drift while maintaining a real computation chain.
test/00_misc/FloatFloatTests.cu	Adds `SquareRootFast` and `Norm3d` test cases with good coverage: zero, small, large, very large, mixed sign, one/two zero components, and all-zeros inputs. The redundant double cast noted in a prior review thread has been cleaned up. The `SquareRoot` test has also been expanded with the same coverage set.
test/00_transform/SarBp.cu	Tolerance adjustments from `1.0e-10` to `1.05e-10` for the double and FloatFloat backprojectors reflect the changed speed-of-light constant and are consistent with the PR description's accuracy analysis.
include/matx/transforms/sar_bp.h	Single-line change to pass `dr_inv` as `fltflt` to the FloatFloat kernel path, replacing the previous in-kernel conversion. Straightforward and correct.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["SarBp kernel (FloatFloat path)"] --> B["Shared memory load phase (per pulse block)"]
    B --> C["Convert ant_pos x/y/z → fltflt"]
    B --> D["Precompute: -r_to_mcp × dr_inv + bin_offset → sh_mem.ant_pos\[ip\]\[3\]"]
    D --> E["Inner loop (per pixel × per pulse)"]
    E --> F["ComputeRangeToPixelFloatFloat()\n→ fltflt_norm3d(dx,dy,dz)"]
    F --> F1["fltflt_two_prod_fma: exact hi² for x,y,z"]
    F1 --> F2["fltflt_two_sum accumulation of hi² terms"]
    F2 --> F3["Accumulate 8 low-order corrections into float lo"]
    F3 --> F4["fltflt_fast_two_sum(t.hi, lo)"]
    F4 --> F5["fltflt_sqrt_fast(sum_sq)"]
    F5 --> G["diffR = fltflt distance to pixel"]
    G --> H["bin = fltflt_fma(diffR, dr_inv, sh_mem.ant_pos\[ip\]\[3\])\n≡ (dist − mcp) × dr_inv + bin_offset"]
    H --> I["Extract bin_floor_int and w\nvia frac/adjust scheme"]
    I --> J{"bin_floor_int in [0, num_range_bins−2]?"}
    J -- yes --> K["Interpolate range profile sample"]
    K --> L["get_reference_phase (PhaseLUT)"]
    L --> M["Accumulate into pixel"]
    J -- no --> N["Skip pulse"]

_{Last reviewed commit: "Remove unused max_bi..."}

test/00_misc/FloatFloatTests.cu

include/matx/kernels/fltflt.h

Signed-off-by: Thomas Benson <tbenson@nvidia.com>

include/matx/kernels/fltflt.h

Signed-off-by: Thomas Benson <tbenson@nvidia.com>

tbensonatl · 2026-03-19T16:39:06Z

/build

tbensonatl · 2026-03-19T18:43:41Z

/build

tbensonatl requested a review from cliffburdick March 19, 2026 02:17

tbensonatl self-assigned this Mar 19, 2026

greptile-apps bot reviewed Mar 19, 2026

View reviewed changes

test/00_misc/FloatFloatTests.cu Show resolved Hide resolved

include/matx/kernels/fltflt.h Show resolved Hide resolved

tbensonatl added 2 commits March 19, 2026 09:11

Add norm3d test cases and sqrt_fast/norm3d benchmarks

13acce9

Signed-off-by: Thomas Benson <tbenson@nvidia.com>

Remove redundant double cast in FloatFloat tests

cfeb582

Signed-off-by: Thomas Benson <tbenson@nvidia.com>

greptile-apps bot reviewed Mar 19, 2026

View reviewed changes

include/matx/kernels/fltflt.h Show resolved Hide resolved

cliffburdick approved these changes Mar 19, 2026

View reviewed changes

Remove unused max_bin_f variable in SarBP; update fltflt ctor comment

24d48be

Signed-off-by: Thomas Benson <tbenson@nvidia.com>

tbensonatl merged commit 86deef3 into main Mar 20, 2026
1 check passed

tbensonatl deleted the add-fltflt-fastsqrt-for-sarbp branch March 26, 2026 20:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SarBp optimizations and accuracy improvements#1140

Add SarBp optimizations and accuracy improvements#1140
tbensonatl merged 4 commits intomainfrom
add-fltflt-fastsqrt-for-sarbp

tbensonatl commented Mar 19, 2026 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Mar 19, 2026

Uh oh!

tbensonatl commented Mar 19, 2026

Uh oh!

greptile-apps bot commented Mar 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tbensonatl commented Mar 19, 2026

Uh oh!

tbensonatl commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tbensonatl commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot bot commented Mar 19, 2026

Uh oh!

tbensonatl commented Mar 19, 2026

Uh oh!

greptile-apps bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tbensonatl commented Mar 19, 2026

Uh oh!

tbensonatl commented Mar 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tbensonatl commented Mar 19, 2026 •

edited

Loading

greptile-apps bot commented Mar 19, 2026 •

edited

Loading