Skip to content

Add SarBp optimizations and accuracy improvements#1140

Merged
tbensonatl merged 4 commits intomainfrom
add-fltflt-fastsqrt-for-sarbp
Mar 20, 2026
Merged

Add SarBp optimizations and accuracy improvements#1140
tbensonatl merged 4 commits intomainfrom
add-fltflt-fastsqrt-for-sarbp

Conversation

@tbensonatl
Copy link
Copy Markdown
Collaborator

@tbensonatl tbensonatl commented Mar 19, 2026

This PR includes several SarBp performance optimizations for the fltflt case:

  • Adds fltflt_sqrt_fast(), which uses fewer operations at a very slight accuracy cost.
  • Adds fltflt_norm3d(), which uses fewer normalizations and calls fltflt_sqrt_fast()
  • Splits the calculation of bin such that more terms can be precomputed and stored in shared memory. Reduced inner loop bin calculation from ~24 FLOPs to ~18.

In addition, this PR changes the speed of light constant to the SI speed of light in a vacuum. The previous value was slightly (~0.03%) lower due to slower propagation through the atmosphere. This PR also adjusts the computation of the weight w to preserve more bits. Previously, the mixed and fltflt implementations computed bin as:

bin = static_cast<loose_compute_t>(diffR * dr_inv) + bin_offset;
w = bin - ::floor(bin)

However, with large bin counts, this can leave relatively few bits of precision for w. The fltflt and mixed variants have been adjusted to preserve more accuracy at the cost of performance. All told, the fltflt version is ~18% faster due to the optimizations, but the mixed-precision version is slower due to increased use of FP64. In the future, a new option may be added to reduce the precision of the bin calculation for scenarios/ranges where that makes sense.

This PR includes several SarBp performance optimizations for the fltflt case:

- Adds fltflt_sqrt_fast(), which uses fewer operations at a very slight
  accuracy cost.
- Adds fltflt_norm3d(), which uses fewer normalizations and calls
  fltflt_sqrt_fast()
- Splits the calculation of bin such that more terms can be precomputed and stored
  in shared memory. Reduced inner loop bin calculation from ~24 FLOPs to ~18.

In addition, this PR adjusts he computation of the weight w to preserve more
bits. Previously, the mixed and fltflt implementations computed bin as:

	bin = static_cast<loose_compute_t>(diffR * dr_inv) + bin_offset;
	w = bin - ::floor(bin)

However, with large bin counts, this can leave relatively few bits of precision
for w. The fltflt and mixed variants have been adjusted to preserve more
accuracy at the cost of performance. All told, the fltflt version is ~15% faster
due to the optimizations, but the mixed-precision version is slower due to increased
use of FP64. In the future, a new option may be added to reduce the precision
of the bin calculation for scenarios/ranges where that makes sense.

Signed-off-by: Thomas Benson <tbenson@nvidia.com>
@tbensonatl tbensonatl requested a review from cliffburdick March 19, 2026 02:17
@tbensonatl tbensonatl self-assigned this Mar 19, 2026
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Mar 19, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@tbensonatl
Copy link
Copy Markdown
Collaborator Author

/build

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 19, 2026

Greptile Summary

This PR delivers SAR back-projection (FloatFloat path) performance improvements (~18% faster) alongside accuracy improvements for all precision modes. It introduces fltflt_sqrt_fast (a ~7-FLOP, ≥44-bit-precision square root), fltflt_norm3d (optimized 3D Euclidean norm that defers normalizations), precomputes (-r_to_mcp * dr_inv + bin_offset) into shared memory to shrink the inner-loop bin computation, corrects the speed-of-light constant to the SI-defined value (299792458.0 m/s), and improves w precision for the mixed/FloatFloat paths by computing floor at higher precision before casting.

  • New fltflt_sqrt_fast: Uses a single FMA for the Newton-Raphson residual instead of full fltflt subtraction, reducing cost from ~35 to ~7 FLOPs at a small (≤3-bit) accuracy trade-off. Well-tested with zero, small, large, and very-large inputs.
  • New fltflt_norm3d: Exactly squares all three .hi components via fltflt_two_prod_fma, accumulates 8 low-order correction terms (including the 2·hi·lo cross terms) into a single float, performs one normalization, then applies fltflt_sqrt_fast. The dropped lo² terms are O(ε²) relative to the result — below the 44-bit precision floor — making this a correct approximation.
  • Shared memory precomputation: Moves (-r_to_mcp * dr_inv + bin_offset) out of the per-pixel inner loop into the shared-memory load phase, saving ~6 FLOPs per pulse per pixel.
  • Accuracy fix for w: Non-FloatFloat paths now compute floor(bin) in the strict precision type (double or float) before casting, preserving more fractional bits for w. FloatFloat uses a frac/adjust scheme that correctly handles bin.lo boundary crossings.
  • Speed-of-light constant: Updated from the atmospheric value (2.997291625e8) to the exact SI vacuum value (299792458.0), with a note that atmospheric correction is the caller's responsibility. SAR test tolerances are relaxed by 5% accordingly.
  • Dead code in benchmark script: parse_benchmark_output_no_type is defined but never called because fltflt_only_benchmarks is always an empty set (see inline comment).

Confidence Score: 4/5

  • Safe to merge with one minor dead-code cleanup; all algorithmic changes are well-justified, tested, and documented.
  • The numerical algorithms (fltflt_sqrt_fast, fltflt_norm3d) are mathematically sound — preconditions for fltflt_fast_two_sum are provably satisfied, the dropped lo² terms are negligibly small, and the unit tests cover zero, small, large, and mixed-sign inputs. The SAR kernel refactoring is logically correct and the boundary check change (bin_floor_int >= 0 && < num_range_bins-1) is actually more accurate than the old max_bin_f = num_range_bins - 2 check. The only non-trivial finding is dead code in the benchmark script, which does not affect runtime correctness.
  • bench/scripts/run_fltflt_benchmarks.py — contains unreachable parse_benchmark_output_no_type function.

Important Files Changed

Filename Overview
include/matx/kernels/fltflt.h Adds fltflt_sqrt_fast (~7 FLOPs, ~44-45 bit precision) and fltflt_norm3d (optimized 3D Euclidean norm via deferred normalization). Both functions are well-documented, handle the zero case, and correctly satisfy fltflt_fast_two_sum's precondition (`
include/matx/kernels/sar_bp.cuh FloatFloat bin computation refactored to precompute (-r_to_mcp * dr_inv + bin_offset) in shared memory (reducing inner-loop FLOPs from ~24 to ~18). New frac/adjust logic for extracting bin floor and w from a fltflt bin value is mathematically correct for any adjust value. Speed-of-light constant updated to SI-defined value (299792458.0 m/s). dr_inv is now passed as fltflt directly for the FloatFloat path.
bench/scripts/run_fltflt_benchmarks.py Adds sqrt_fast and norm3d to the benchmark list and defines parse_benchmark_output_no_type for fltflt-only benchmarks. However, fltflt_only_benchmarks is always an empty set, making parse_benchmark_output_no_type permanently unreachable dead code.
bench/00_misc/fltflt_arithmetic.cu Adds iterative_sqrt_fast_kernel and iterative_norm3d_kernel benchmarks with proper ILP dependency chains. The norm3d kernel feeds results back into dx with a float offset subtraction (explicitly chosen over fltflt subtraction) to bound register drift while maintaining a real computation chain.
test/00_misc/FloatFloatTests.cu Adds SquareRootFast and Norm3d test cases with good coverage: zero, small, large, very large, mixed sign, one/two zero components, and all-zeros inputs. The redundant double cast noted in a prior review thread has been cleaned up. The SquareRoot test has also been expanded with the same coverage set.
test/00_transform/SarBp.cu Tolerance adjustments from 1.0e-10 to 1.05e-10 for the double and FloatFloat backprojectors reflect the changed speed-of-light constant and are consistent with the PR description's accuracy analysis.
include/matx/transforms/sar_bp.h Single-line change to pass dr_inv as fltflt to the FloatFloat kernel path, replacing the previous in-kernel conversion. Straightforward and correct.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["SarBp kernel (FloatFloat path)"] --> B["Shared memory load phase (per pulse block)"]
    B --> C["Convert ant_pos x/y/z → fltflt"]
    B --> D["Precompute: -r_to_mcp × dr_inv + bin_offset → sh_mem.ant_pos\[ip\]\[3\]"]
    D --> E["Inner loop (per pixel × per pulse)"]
    E --> F["ComputeRangeToPixelFloatFloat()\n→ fltflt_norm3d(dx,dy,dz)"]
    F --> F1["fltflt_two_prod_fma: exact hi² for x,y,z"]
    F1 --> F2["fltflt_two_sum accumulation of hi² terms"]
    F2 --> F3["Accumulate 8 low-order corrections into float lo"]
    F3 --> F4["fltflt_fast_two_sum(t.hi, lo)"]
    F4 --> F5["fltflt_sqrt_fast(sum_sq)"]
    F5 --> G["diffR = fltflt distance to pixel"]
    G --> H["bin = fltflt_fma(diffR, dr_inv, sh_mem.ant_pos\[ip\]\[3\])\n≡ (dist − mcp) × dr_inv + bin_offset"]
    H --> I["Extract bin_floor_int and w\nvia frac/adjust scheme"]
    I --> J{"bin_floor_int in [0, num_range_bins−2]?"}
    J -- yes --> K["Interpolate range profile sample"]
    K --> L["get_reference_phase (PhaseLUT)"]
    L --> M["Accumulate into pixel"]
    J -- no --> N["Skip pulse"]
Loading

Last reviewed commit: "Remove unused max_bi..."

Signed-off-by: Thomas Benson <tbenson@nvidia.com>
Signed-off-by: Thomas Benson <tbenson@nvidia.com>
Signed-off-by: Thomas Benson <tbenson@nvidia.com>
@tbensonatl
Copy link
Copy Markdown
Collaborator Author

/build

1 similar comment
@tbensonatl
Copy link
Copy Markdown
Collaborator Author

/build

@tbensonatl tbensonatl merged commit 86deef3 into main Mar 20, 2026
1 check passed
@tbensonatl tbensonatl deleted the add-fltflt-fastsqrt-for-sarbp branch March 26, 2026 20:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants