Add SarBp optimizations and accuracy improvements#1140
Merged
tbensonatl merged 4 commits intomainfrom Mar 20, 2026
Merged
Conversation
This PR includes several SarBp performance optimizations for the fltflt case: - Adds fltflt_sqrt_fast(), which uses fewer operations at a very slight accuracy cost. - Adds fltflt_norm3d(), which uses fewer normalizations and calls fltflt_sqrt_fast() - Splits the calculation of bin such that more terms can be precomputed and stored in shared memory. Reduced inner loop bin calculation from ~24 FLOPs to ~18. In addition, this PR adjusts he computation of the weight w to preserve more bits. Previously, the mixed and fltflt implementations computed bin as: bin = static_cast<loose_compute_t>(diffR * dr_inv) + bin_offset; w = bin - ::floor(bin) However, with large bin counts, this can leave relatively few bits of precision for w. The fltflt and mixed variants have been adjusted to preserve more accuracy at the cost of performance. All told, the fltflt version is ~15% faster due to the optimizations, but the mixed-precision version is slower due to increased use of FP64. In the future, a new option may be added to reduce the precision of the bin calculation for scenarios/ranges where that makes sense. Signed-off-by: Thomas Benson <tbenson@nvidia.com>
Collaborator
Author
|
/build |
Contributor
Greptile SummaryThis PR delivers SAR back-projection (FloatFloat path) performance improvements (~18% faster) alongside accuracy improvements for all precision modes. It introduces
Confidence Score: 4/5
Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["SarBp kernel (FloatFloat path)"] --> B["Shared memory load phase (per pulse block)"]
B --> C["Convert ant_pos x/y/z → fltflt"]
B --> D["Precompute: -r_to_mcp × dr_inv + bin_offset → sh_mem.ant_pos\[ip\]\[3\]"]
D --> E["Inner loop (per pixel × per pulse)"]
E --> F["ComputeRangeToPixelFloatFloat()\n→ fltflt_norm3d(dx,dy,dz)"]
F --> F1["fltflt_two_prod_fma: exact hi² for x,y,z"]
F1 --> F2["fltflt_two_sum accumulation of hi² terms"]
F2 --> F3["Accumulate 8 low-order corrections into float lo"]
F3 --> F4["fltflt_fast_two_sum(t.hi, lo)"]
F4 --> F5["fltflt_sqrt_fast(sum_sq)"]
F5 --> G["diffR = fltflt distance to pixel"]
G --> H["bin = fltflt_fma(diffR, dr_inv, sh_mem.ant_pos\[ip\]\[3\])\n≡ (dist − mcp) × dr_inv + bin_offset"]
H --> I["Extract bin_floor_int and w\nvia frac/adjust scheme"]
I --> J{"bin_floor_int in [0, num_range_bins−2]?"}
J -- yes --> K["Interpolate range profile sample"]
K --> L["get_reference_phase (PhaseLUT)"]
L --> M["Accumulate into pixel"]
J -- no --> N["Skip pulse"]
Last reviewed commit: "Remove unused max_bi..." |
Signed-off-by: Thomas Benson <tbenson@nvidia.com>
Signed-off-by: Thomas Benson <tbenson@nvidia.com>
cliffburdick
approved these changes
Mar 19, 2026
Signed-off-by: Thomas Benson <tbenson@nvidia.com>
Collaborator
Author
|
/build |
1 similar comment
Collaborator
Author
|
/build |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR includes several SarBp performance optimizations for the fltflt case:
In addition, this PR changes the speed of light constant to the SI speed of light in a vacuum. The previous value was slightly (~0.03%) lower due to slower propagation through the atmosphere. This PR also adjusts the computation of the weight w to preserve more bits. Previously, the mixed and fltflt implementations computed bin as:
However, with large bin counts, this can leave relatively few bits of precision for w. The fltflt and mixed variants have been adjusted to preserve more accuracy at the cost of performance. All told, the fltflt version is ~18% faster due to the optimizations, but the mixed-precision version is slower due to increased use of FP64. In the future, a new option may be added to reduce the precision of the bin calculation for scenarios/ranges where that makes sense.