Skip to content

Conversation

@github-actions
Copy link
Contributor

Summary

This PR optimizes QR decomposition using Householder reflections, achieving 19-44% speedup for typical matrix sizes by replacing manual scalar operations with SIMD-accelerated dot products and row updates.

Performance Goal

Goal Selected: Optimize QR decomposition (Phase 3, Linear Algebra Optimizations)

Rationale: The research plan from Discussion #4 identified Phase 3 linear algebra optimizations as the next step after Phase 1 & 2 work. QR decomposition is fundamental to many operations (least squares, eigenvalues, matrix factorization) and the Householder implementation had clear opportunities for SIMD optimization in its inner loops.

Changes Made

Core Optimization

File Modified: src/FsMath/Algebra/LinearAlgebra.fs - updateQ and applyHouseholderLeft functions (lines 361-419)

Original updateQ Implementation:

for row = 0 to nQ - 1 do
    let mutable dot = 'T.Zero
    for k = 0 to v.Length - 1 do
        dot <- dot + Q.[row, i + k] * v.[k]  // Manual scalar dot
    let alpha = dot + dot
    for k = 0 to v.Length - 1 do
        Q.[row, i + k] <- Q.[row, i + k] - alpha * v.[k]  // Manual scalar update

Optimized updateQ Implementation:

for row = 0 to nQ - 1 do
    let rowOffset = row * mQ + i
    
    // SIMD-optimized dot product using SpanMath
    let dot = SpanMath.dotUnchecked (ReadOnlySpan(qData, rowOffset, vLen), ReadOnlySpan(v))
    let alpha = dot + dot
    
    // SIMD-optimized row update with vectorization
    if alpha <> 'T.Zero then  // Skip if alpha is zero
        let alphaVec = Numerics.Vector<'T>(alpha)
        let mutable k = 0
        let simdWidth = Numerics.Vector<'T>.Count
        
        while k + simdWidth <= vLen do
            let qVec = Numerics.Vector<'T>(qData, rowOffset + k)
            let vVec = Numerics.Vector<'T>(v, k)
            let result = qVec - alphaVec * vVec
            result.CopyTo(qData, rowOffset + k)
            k <- k + simdWidth
        
        // Scalar tail for remainder
        while k < vLen do
            qData.[rowOffset + k] <- qData.[rowOffset + k] - alpha * v.[k]
            k <- k + 1

applyHouseholderLeft Optimization

The applyHouseholderLeft function was also optimized with cleaner code structure, though the column-wise strided access pattern limits SIMD gains here. The main improvements come from:

  • Better code organization
  • Zero-check optimization to skip unnecessary work
  • Clear documentation of memory access patterns

Approach

  1. ✅ Reviewed Phase 3 opportunities from the performance plan
  2. ✅ Selected QR decomposition as high-impact Phase 3 target
  3. ✅ Ran baseline benchmarks (5.108 μs, 72.857 μs, 302.220 μs for 10×10, 30×30, 50×50)
  4. ✅ Identified hot spots: dot products and row updates in Householder transformations
  5. ✅ Implemented SIMD optimizations using SpanMath.dotUnchecked and Vector operations
  6. ✅ Added zero-check optimization to skip unnecessary computation
  7. ✅ Built project and verified all 1381 tests pass
  8. ✅ Ran optimized benchmarks and measured 19-44% improvements

Performance Measurements

Test Environment

  • Platform: Linux Ubuntu 24.04.3 LTS (virtualized)
  • CPU: AMD EPYC 7763, 2 physical cores (4 logical) with AVX2
  • Runtime: .NET 8.0.20 with hardware SIMD acceleration
  • Job: ShortRun (3 warmup, 3 iterations, 1 launch)

Results Summary

Matrix Size Before (Baseline) After (Optimized) Improvement Speedup
10×10 5.108 μs 4.140 μs 19.0% faster 1.23×
30×30 72.857 μs 44.644 μs 38.7% faster 1.63×
50×50 302.220 μs 169.798 μs 43.8% faster 1.78×

Detailed Benchmark Results

Before (Baseline):

| Method   | Mean       | Error      | StdDev    | Gen0   | Gen1   | Allocated |
|--------- |-----------:|-----------:|----------:|-------:|-------:|----------:|
| QR_10x10 |   5.108 us |  0.3796 us | 0.0208 us | 0.3586 |      - |   5.98 KB |
| QR_30x30 |  72.857 us |  4.3718 us | 0.2396 us | 2.4414 | 0.1221 |  41.29 KB |
| QR_50x50 | 302.220 us | 21.3500 us | 1.1703 us | 6.3477 | 0.4883 | 109.54 KB |

After (Optimized):

| Method   | Mean       | Error     | StdDev    | Gen0   | Gen1   | Allocated |
|--------- |-----------:|----------:|----------:|-------:|-------:|----------:|
| QR_10x10 |   4.140 us | 0.8750 us | 0.0480 us | 0.3586 |      - |   5.98 KB |
| QR_30x30 |  44.644 us | 2.7209 us | 0.1491 us | 2.5024 | 0.1221 |  41.29 KB |
| QR_50x50 | 169.798 us | 8.7992 us | 0.4823 us | 6.5918 | 0.7324 | 109.54 KB |

Key Observations

  1. Consistent Speedup: 19-44% improvement across all matrix sizes
  2. Better Scaling: Larger matrices see greater relative improvement (44% for 50×50)
  3. Memory Efficiency: Allocations unchanged - same memory footprint
  4. Low Variance: Standard deviations are small, indicating stable performance
  5. SIMD Effectiveness: Row-major access pattern in Q matrix enables full SIMD utilization

Why This Works

The optimization addresses key bottlenecks in the Householder transformations:

  1. SIMD Dot Products:

    • Before: Sequential scalar multiply-add operations
    • After: Hardware-accelerated SIMD dot product via SpanMath.dotUnchecked
    • Result: Parallel processing of multiple elements per clock cycle
  2. Vectorized Row Updates:

    • Before: Sequential scalar subtractions
    • After: SIMD vector operations processing multiple elements at once
    • Result: Better instruction-level parallelism and cache utilization
  3. Zero-Check Optimization:

    • Skip row updates when alpha is zero (saves unnecessary computation)
    • Improves performance for sparse or near-zero cases
  4. Contiguous Memory Access:

    • Row-major storage in Q matrix enables efficient SIMD operations
    • Better cache line utilization for sequential access patterns

Replicating the Performance Measurements

To replicate these benchmarks:

# 1. Check out this branch
git checkout perf/optimize-qr-householder-simd-20251015-031117-d376753

# 2. Build the project
./build.sh

# 3. Run QR benchmarks with short job (~30 seconds)
cd benchmarks/FsMath.Benchmarks
dotnet run -c Release -- --filter "*QR*" --job short

# 4. For production-quality measurements (~2-3 minutes)
dotnet run -c Release -- --filter "*QR*"

# 5. Compare with baseline by checking out main first
git checkout main
./build.sh
cd benchmarks/FsMath.Benchmarks
dotnet run -c Release -- --filter "*QR*" --job short

Results are saved to BenchmarkDotNet.Artifacts/results/ in multiple formats.

Testing

✅ All 1381 tests pass (8 skipped)
✅ QR benchmarks execute successfully
✅ Memory allocations unchanged
✅ Performance improves 19-44% across all sizes
✅ Correctness verified across all test cases
✅ Build completes with only pre-existing warnings

Implementation Details

Optimization Techniques Applied

  1. SIMD Dot Product: Use SpanMath.dotUnchecked for hardware-accelerated dot products
  2. Vectorized Row Operations: Process multiple elements per iteration using Numerics.Vector<T>
  3. Zero-Check Optimization: Skip unnecessary computation when alpha is zero
  4. Direct Memory Access: Work with raw array data and offsets for better performance
  5. Tail Handling: Scalar fallback for non-SIMD-aligned remainders

Code Quality

  • Clear separation of SIMD and scalar code paths
  • Comprehensive documentation explaining optimizations
  • Preserves existing error handling and validation
  • Follows existing code style and patterns
  • Maintains backward compatibility
  • No breaking changes to API

Limitations and Future Work

While this optimization provides solid improvements, there are additional opportunities:

  1. Column Operations in applyHouseholderLeft: Strided access pattern limits SIMD gains - could explore transpose-based approaches
  2. Larger Matrices: Could add blocked/tiled approaches for matrices >100×100
  3. Parallel QR: Very large matrices (≥200×200) could benefit from parallelization
  4. Alternative Algorithms: Modified Gram-Schmidt (already implemented) might be faster for some cases
  5. FMA Instructions: Could leverage Fused Multiply-Add for even better performance

Next Steps

Based on the performance plan from Discussion #4, remaining Phase 3 work includes:

  1. QR decomposition optimization (this PR)
  2. ⚠️ Other linear algebra optimizations - LU, Cholesky, EVD/SVD
  3. ⚠️ Parallel implementations - For large matrices
  4. ⚠️ Specialized fast paths - Small matrix (2×2, 3×3, 4×4) optimizations

Related Issues/Discussions


Bash Commands Used

# Research and setup
cd /home/runner/work/FsMath/FsMath
git status
git checkout -b perf/optimize-qr-householder-simd-20251015-031117-d376753

# Baseline benchmarking
cd benchmarks/FsMath.Benchmarks
dotnet run -c Release -- --filter "*QR*" --job short > baseline.txt

# Development
# (edited LinearAlgebra.fs - updateQ and applyHouseholderLeft functions)

# Build and test
cd /home/runner/work/FsMath/FsMath
./build.sh
dotnet test tests/FsMath.Tests/FsMath.Tests.fsproj -c Release --no-build

# Performance verification
cd benchmarks/FsMath.Benchmarks
dotnet run -c Release -- --filter "*QR*" --job short > optimized.txt

# Commit and create PR
cd /home/runner/work/FsMath/FsMath
git add src/FsMath/Algebra/LinearAlgebra.fs
git commit -m "Optimize QR decomposition with SIMD operations..."

Web Searches Performed

None - this optimization was based on:


🤖 Generated with Claude Code

AI generated by Daily Perf Improver

AI generated by Daily Perf Improver

…rmations

- Replace manual scalar dot products with SpanMath.dotUnchecked for SIMD acceleration
- Optimize row updates in updateQ with vectorized operations
- Add zero-check optimization to skip unnecessary work
- Achieve 19-44% speedup across matrix sizes (1.23-1.78× faster)

Performance improvements:
- 10×10: 5.108 μs → 4.140 μs (19.0% faster)
- 30×30: 72.857 μs → 44.644 μs (38.7% faster)
- 50×50: 302.220 μs → 169.798 μs (43.8% faster)

All 1381 tests pass with no change in allocations.

🤖 Generated with Claude Code (https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@dsyme
Copy link
Member

dsyme commented Oct 22, 2025

@kMutagene @muehlhaus With these, the procedure should be

  1. Do a high level scan to see if we even remotely . If not, close it out
  2. Carefully check that the AI took measurements. If it didn't, ignore the PR.
  3. Replicate the perf improvement measurements locally and assess whether they're credible and important
  4. Proceed to careful code and test review
  5. If it's "close" then complete the work (e.g. adding more tests) e.g. using a coding agent locally.

@dsyme
Copy link
Member

dsyme commented Oct 22, 2025

One rule: don't believe any "estimates" the coding agents give!!!

@github-actions
Copy link
Contributor Author

📊 Code Coverage Report

Summary

Code Coverage

Package Line Rate Branch Rate Complexity Health
FsMath 78% 51% 4399
FsMath 78% 51% 4399
Summary 78% (3174 / 4068) 51% (4436 / 8662) 8798

📈 Coverage Analysis

🟡 Good Coverage Your code coverage is above 60%. Consider adding more tests to reach 80%.

🎯 Coverage Goals

  • Target: 80% line coverage
  • Minimum: 60% line coverage
  • Current: 78% line coverage

📋 What These Numbers Mean

  • Line Rate: Percentage of code lines that were executed during tests
  • Branch Rate: Percentage of code branches (if/else, switch cases) that were tested
  • Health: Overall assessment combining line and branch coverage

🔗 Detailed Reports

📋 Download Full Coverage Report - Check the 'coverage-report' artifact for detailed HTML coverage report


Coverage report generated on 2025-10-28 at 12:09:31 UTC

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants