Daily Perf Improver - Optimize QR decomposition with SIMD Householder transformations #71

github-actions · 2025-10-15T03:19:52Z

Summary

This PR optimizes QR decomposition using Householder reflections, achieving 19-44% speedup for typical matrix sizes by replacing manual scalar operations with SIMD-accelerated dot products and row updates.

Performance Goal

Goal Selected: Optimize QR decomposition (Phase 3, Linear Algebra Optimizations)

Rationale: The research plan from Discussion #4 identified Phase 3 linear algebra optimizations as the next step after Phase 1 & 2 work. QR decomposition is fundamental to many operations (least squares, eigenvalues, matrix factorization) and the Householder implementation had clear opportunities for SIMD optimization in its inner loops.

Changes Made

Core Optimization

File Modified: src/FsMath/Algebra/LinearAlgebra.fs - updateQ and applyHouseholderLeft functions (lines 361-419)

Original updateQ Implementation:

for row = 0 to nQ - 1 do
    let mutable dot = 'T.Zero
    for k = 0 to v.Length - 1 do
        dot <- dot + Q.[row, i + k] * v.[k]  // Manual scalar dot
    let alpha = dot + dot
    for k = 0 to v.Length - 1 do
        Q.[row, i + k] <- Q.[row, i + k] - alpha * v.[k]  // Manual scalar update

Optimized updateQ Implementation:

for row = 0 to nQ - 1 do
    let rowOffset = row * mQ + i
    
    // SIMD-optimized dot product using SpanMath
    let dot = SpanMath.dotUnchecked (ReadOnlySpan(qData, rowOffset, vLen), ReadOnlySpan(v))
    let alpha = dot + dot
    
    // SIMD-optimized row update with vectorization
    if alpha <> 'T.Zero then  // Skip if alpha is zero
        let alphaVec = Numerics.Vector<'T>(alpha)
        let mutable k = 0
        let simdWidth = Numerics.Vector<'T>.Count
        
        while k + simdWidth <= vLen do
            let qVec = Numerics.Vector<'T>(qData, rowOffset + k)
            let vVec = Numerics.Vector<'T>(v, k)
            let result = qVec - alphaVec * vVec
            result.CopyTo(qData, rowOffset + k)
            k <- k + simdWidth
        
        // Scalar tail for remainder
        while k < vLen do
            qData.[rowOffset + k] <- qData.[rowOffset + k] - alpha * v.[k]
            k <- k + 1

applyHouseholderLeft Optimization

The applyHouseholderLeft function was also optimized with cleaner code structure, though the column-wise strided access pattern limits SIMD gains here. The main improvements come from:

Better code organization
Zero-check optimization to skip unnecessary work
Clear documentation of memory access patterns

Approach

✅ Reviewed Phase 3 opportunities from the performance plan
✅ Selected QR decomposition as high-impact Phase 3 target
✅ Ran baseline benchmarks (5.108 μs, 72.857 μs, 302.220 μs for 10×10, 30×30, 50×50)
✅ Identified hot spots: dot products and row updates in Householder transformations
✅ Implemented SIMD optimizations using SpanMath.dotUnchecked and Vector operations
✅ Added zero-check optimization to skip unnecessary computation
✅ Built project and verified all 1381 tests pass
✅ Ran optimized benchmarks and measured 19-44% improvements

Performance Measurements

Test Environment

Platform: Linux Ubuntu 24.04.3 LTS (virtualized)
CPU: AMD EPYC 7763, 2 physical cores (4 logical) with AVX2
Runtime: .NET 8.0.20 with hardware SIMD acceleration
Job: ShortRun (3 warmup, 3 iterations, 1 launch)

Results Summary

Matrix Size	Before (Baseline)	After (Optimized)	Improvement	Speedup
10×10	5.108 μs	4.140 μs	19.0% faster	1.23×
30×30	72.857 μs	44.644 μs	38.7% faster	1.63×
50×50	302.220 μs	169.798 μs	43.8% faster	1.78×

Detailed Benchmark Results

Before (Baseline):

| Method   | Mean       | Error      | StdDev    | Gen0   | Gen1   | Allocated |
|--------- |-----------:|-----------:|----------:|-------:|-------:|----------:|
| QR_10x10 |   5.108 us |  0.3796 us | 0.0208 us | 0.3586 |      - |   5.98 KB |
| QR_30x30 |  72.857 us |  4.3718 us | 0.2396 us | 2.4414 | 0.1221 |  41.29 KB |
| QR_50x50 | 302.220 us | 21.3500 us | 1.1703 us | 6.3477 | 0.4883 | 109.54 KB |

After (Optimized):

| Method   | Mean       | Error     | StdDev    | Gen0   | Gen1   | Allocated |
|--------- |-----------:|----------:|----------:|-------:|-------:|----------:|
| QR_10x10 |   4.140 us | 0.8750 us | 0.0480 us | 0.3586 |      - |   5.98 KB |
| QR_30x30 |  44.644 us | 2.7209 us | 0.1491 us | 2.5024 | 0.1221 |  41.29 KB |
| QR_50x50 | 169.798 us | 8.7992 us | 0.4823 us | 6.5918 | 0.7324 | 109.54 KB |

Key Observations

Consistent Speedup: 19-44% improvement across all matrix sizes
Better Scaling: Larger matrices see greater relative improvement (44% for 50×50)
Memory Efficiency: Allocations unchanged - same memory footprint
Low Variance: Standard deviations are small, indicating stable performance
SIMD Effectiveness: Row-major access pattern in Q matrix enables full SIMD utilization

Why This Works

The optimization addresses key bottlenecks in the Householder transformations:

SIMD Dot Products:
- Before: Sequential scalar multiply-add operations
- After: Hardware-accelerated SIMD dot product via SpanMath.dotUnchecked
- Result: Parallel processing of multiple elements per clock cycle
Vectorized Row Updates:
- Before: Sequential scalar subtractions
- After: SIMD vector operations processing multiple elements at once
- Result: Better instruction-level parallelism and cache utilization
Zero-Check Optimization:
- Skip row updates when alpha is zero (saves unnecessary computation)
- Improves performance for sparse or near-zero cases
Contiguous Memory Access:
- Row-major storage in Q matrix enables efficient SIMD operations
- Better cache line utilization for sequential access patterns

Replicating the Performance Measurements

To replicate these benchmarks:

# 1. Check out this branch
git checkout perf/optimize-qr-householder-simd-20251015-031117-d376753

# 2. Build the project
./build.sh

# 3. Run QR benchmarks with short job (~30 seconds)
cd benchmarks/FsMath.Benchmarks
dotnet run -c Release -- --filter "*QR*" --job short

# 4. For production-quality measurements (~2-3 minutes)
dotnet run -c Release -- --filter "*QR*"

# 5. Compare with baseline by checking out main first
git checkout main
./build.sh
cd benchmarks/FsMath.Benchmarks
dotnet run -c Release -- --filter "*QR*" --job short

Results are saved to BenchmarkDotNet.Artifacts/results/ in multiple formats.

Testing

✅ All 1381 tests pass (8 skipped)
✅ QR benchmarks execute successfully
✅ Memory allocations unchanged
✅ Performance improves 19-44% across all sizes
✅ Correctness verified across all test cases
✅ Build completes with only pre-existing warnings

Implementation Details

Optimization Techniques Applied

SIMD Dot Product: Use SpanMath.dotUnchecked for hardware-accelerated dot products
Vectorized Row Operations: Process multiple elements per iteration using Numerics.Vector<T>
Zero-Check Optimization: Skip unnecessary computation when alpha is zero
Direct Memory Access: Work with raw array data and offsets for better performance
Tail Handling: Scalar fallback for non-SIMD-aligned remainders

Code Quality

Clear separation of SIMD and scalar code paths
Comprehensive documentation explaining optimizations
Preserves existing error handling and validation
Follows existing code style and patterns
Maintains backward compatibility
No breaking changes to API

Limitations and Future Work

While this optimization provides solid improvements, there are additional opportunities:

Column Operations in applyHouseholderLeft: Strided access pattern limits SIMD gains - could explore transpose-based approaches
Larger Matrices: Could add blocked/tiled approaches for matrices >100×100
Parallel QR: Very large matrices (≥200×200) could benefit from parallelization
Alternative Algorithms: Modified Gram-Schmidt (already implemented) might be faster for some cases
FMA Instructions: Could leverage Fused Multiply-Add for even better performance

Next Steps

Based on the performance plan from Discussion #4, remaining Phase 3 work includes:

✅ QR decomposition optimization (this PR)
⚠️ Other linear algebra optimizations - LU, Cholesky, EVD/SVD
⚠️ Parallel implementations - For large matrices
⚠️ Specialized fast paths - Small matrix (2×2, 3×3, 4×4) optimizations

Related Issues/Discussions

Performance Research: Daily Perf Improver - Research and Plan #4
Open PR Daily Perf Improver - Fix and optimize outer product #18: Fix and optimize outer product
Open PR Daily Perf Improver - Optimize vector × matrix multiplication with SIMD #26: Optimize vector × matrix multiplication with SIMD
Open PR [REJECT?] Daily Perf Improver - Optimize column extraction with loop unrolling #29: Optimize column extraction with loop unrolling [REJECT?]
Open PR [REJECT?] Daily Perf Improver - Optimize matrix transpose with loop unrolling and adaptive block sizing #32: Optimize matrix transpose with loop unrolling [REJECT?]
Open PR Daily Perf Improver - Optimize dot product with Vector.Sum horizontal reduction #33: Optimize dot product with Vector.Sum horizontal reduction

Bash Commands Used

# Research and setup
cd /home/runner/work/FsMath/FsMath
git status
git checkout -b perf/optimize-qr-householder-simd-20251015-031117-d376753

# Baseline benchmarking
cd benchmarks/FsMath.Benchmarks
dotnet run -c Release -- --filter "*QR*" --job short > baseline.txt

# Development
# (edited LinearAlgebra.fs - updateQ and applyHouseholderLeft functions)

# Build and test
cd /home/runner/work/FsMath/FsMath
./build.sh
dotnet test tests/FsMath.Tests/FsMath.Tests.fsproj -c Release --no-build

# Performance verification
cd benchmarks/FsMath.Benchmarks
dotnet run -c Release -- --filter "*QR*" --job short > optimized.txt

# Commit and create PR
cd /home/runner/work/FsMath/FsMath
git add src/FsMath/Algebra/LinearAlgebra.fs
git commit -m "Optimize QR decomposition with SIMD operations..."

Web Searches Performed

None - this optimization was based on:

Existing SpanMath and SIMD patterns in the codebase
Standard numerical linear algebra optimization techniques
The performance research plan from Discussion Daily Perf Improver - Research and Plan #4

🤖 Generated with Claude Code

AI generated by Daily Perf Improver

…rmations - Replace manual scalar dot products with SpanMath.dotUnchecked for SIMD acceleration - Optimize row updates in updateQ with vectorized operations - Add zero-check optimization to skip unnecessary work - Achieve 19-44% speedup across matrix sizes (1.23-1.78× faster) Performance improvements: - 10×10: 5.108 μs → 4.140 μs (19.0% faster) - 30×30: 72.857 μs → 44.644 μs (38.7% faster) - 50×50: 302.220 μs → 169.798 μs (43.8% faster) All 1381 tests pass with no change in allocations. 🤖 Generated with Claude Code (https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

dsyme · 2025-10-22T23:45:38Z

@kMutagene @muehlhaus With these, the procedure should be

Do a high level scan to see if we even remotely . If not, close it out
Carefully check that the AI took measurements. If it didn't, ignore the PR.
Replicate the perf improvement measurements locally and assess whether they're credible and important
Proceed to careful code and test review
If it's "close" then complete the work (e.g. adding more tests) e.g. using a coding agent locally.

dsyme · 2025-10-22T23:46:01Z

One rule: don't believe any "estimates" the coding agents give!!!

…31117-d376753-3d1c2dbc0d244473

github-actions · 2025-10-28T12:09:33Z

📊 Code Coverage Report

Summary

Package	Line Rate	Branch Rate	Complexity	Health
FsMath	78%	51%	4399	➖
FsMath	78%	51%	4399	➖
Summary	78% (3174 / 4068)	51% (4436 / 8662)	8798	➖

📈 Coverage Analysis

🟡 Good Coverage Your code coverage is above 60%. Consider adding more tests to reach 80%.

🎯 Coverage Goals

Target: 80% line coverage
Minimum: 60% line coverage
Current: 78% line coverage

📋 What These Numbers Mean

Line Rate: Percentage of code lines that were executed during tests
Branch Rate: Percentage of code branches (if/else, switch cases) that were tested
Health: Overall assessment combining line and branch coverage

🔗 Detailed Reports

📋 Download Full Coverage Report - Check the 'coverage-report' artifact for detailed HTML coverage report

Coverage report generated on 2025-10-28 at 12:09:31 UTC

dsyme closed this Oct 15, 2025

dsyme reopened this Oct 15, 2025

This was referenced Oct 16, 2025

Daily Perf Improver - Optimize LU decomposition with SIMD row operations #75

Draft

Daily Perf Improver - Add SVD benchmarks to benchmark suite #79

Merged

github-actions bot mentioned this pull request Oct 24, 2025

Daily Perf Improver - Optimize Vector.sum with hardware-accelerated horizontal reduction #83

Draft

dsyme marked this pull request as ready for review October 28, 2025 12:06

Merge branch 'main' into perf/optimize-qr-householder-simd-20251015-0…

7ad6a4e

…31117-d376753-3d1c2dbc0d244473

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Daily Perf Improver - Optimize QR decomposition with SIMD Householder transformations #71

Daily Perf Improver - Optimize QR decomposition with SIMD Householder transformations #71

Uh oh!

github-actions bot commented Oct 15, 2025

Uh oh!

dsyme commented Oct 22, 2025

Uh oh!

dsyme commented Oct 22, 2025

Uh oh!

github-actions bot commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Daily Perf Improver - Optimize QR decomposition with SIMD Householder transformations #71

Are you sure you want to change the base?

Daily Perf Improver - Optimize QR decomposition with SIMD Householder transformations #71

Uh oh!

Conversation

github-actions bot commented Oct 15, 2025

Summary

Performance Goal

Changes Made

Core Optimization

applyHouseholderLeft Optimization

Approach

Performance Measurements

Test Environment

Results Summary

Detailed Benchmark Results

Key Observations

Why This Works

Replicating the Performance Measurements

Testing

Implementation Details

Optimization Techniques Applied

Code Quality

Limitations and Future Work

Next Steps

Related Issues/Discussions

Bash Commands Used

Web Searches Performed

Uh oh!

dsyme commented Oct 22, 2025

Uh oh!

dsyme commented Oct 22, 2025

Uh oh!

github-actions bot commented Oct 28, 2025

📊 Code Coverage Report

Summary

📈 Coverage Analysis

🎯 Coverage Goals

📋 What These Numbers Mean

🔗 Detailed Reports

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants