Optimize extract_core_parallel to reduce memory usage through streaming BC calculations and worker pool management

## Problem Statement

The current `extract_core_parallel` function has significant memory issues that cause OOM (Out of Memory) kills on HPC systems:

1. **Each worker loads the entire dataset into memory**
2. **Bray-Curtis calculations create large intermediate matrices**  
3. **Memory usage scales multiplicatively with the number of workers**

These issues prevent successful execution on large datasets, even with substantial memory allocation (e.g., 256GB RAM with 32 cores).

## Proposed Solutions

### Streaming/Incremental BC Calculations

**Current Memory-Heavy Approach:**
```r
# Calculate BC for each OTU addition by rebuilding entire matrix
current_matrix <- rbind(start_matrix, t(otu[otu_ranked$otu[i], ]))
current_bc <- calculate_bc(current_matrix, nReads)  # Full recalculation
```

**Proposed Incremental Approach:**
- Maintain running sums of BC components instead of storing growing matrices
- For each new OTU, add only its incremental contribution to existing totals
- Calculate BC from running totals: `numerator_sum / (2 * nReads)`

**Benefits:**
- Constant memory per OTU addition regardless of dataset size
- No matrix accumulation or redundant calculations
- Maintains mathematical accuracy of sequential OTU ranking

### Worker Pool Management

**Current Uncontrolled Parallelism:**
```r
parallel_results <- parallel::mclapply(
    2:nrow(otu_ranked),  # Could be thousands of OTUs
    bc_rank_task,
    mc.cores = ncores    # Could be 31+ cores = 31+ simultaneous workers
)
```

**Proposed Controlled Approach:**
- Add `max_workers` parameter to limit concurrent workers regardless of available cores
- Use `min(max_workers, ncores)` to prevent memory explosion
- Implement memory-aware worker scaling
- Add periodic garbage collection during processing

**Benefits:**
- Controlled memory usage through limited active workers
- Prevents OOM conditions while maintaining parallelization benefits
- Scalable across different system configurations

## Implementation Requirements

### Sequential Dependency Preservation
- **Critical**: The OTU ranking algorithm has sequential dependencies where each OTU's contribution depends on all previously added OTUs
- **No chunking/batching**: Cannot process OTUs independently as this breaks the ranking mathematics
- **Solution**: Maintain sequential OTU addition while optimizing BC calculations and worker management

### Integration Points
- Modify `extract_core_parallel()` in `R/functions/extract_core_parallel.R`
- Maintain compatibility with existing `calculate_bc()` function signature
- Preserve identical output format and mathematical results
- Add new parameters: `max_workers`, optional memory monitoring

### Expected Outcomes
- Successful execution on large datasets without OOM kills
- Reduced memory footprint for BC calculations
- Maintained mathematical accuracy and algorithm correctness
- Improved scalability across different hardware configurations

## Testing Requirements
- Verify identical results compared to original implementation
- Test with various dataset sizes and worker configurations
- Validate memory usage reduction through profiling
- Ensure compatibility with existing downstream analyses

This optimization is essential for processing the Inter-BRC core microbiome datasets that currently fail due to memory constraints.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize extract_core_parallel to reduce memory usage through streaming BC calculations and worker pool management #59

Problem Statement

Proposed Solutions

Streaming/Incremental BC Calculations

Worker Pool Management

Implementation Requirements

Sequential Dependency Preservation

Integration Points

Expected Outcomes

Testing Requirements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Optimize extract_core_parallel to reduce memory usage through streaming BC calculations and worker pool management #59

Description

Problem Statement

Proposed Solutions

Streaming/Incremental BC Calculations

Worker Pool Management

Implementation Requirements

Sequential Dependency Preservation

Integration Points

Expected Outcomes

Testing Requirements

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions