Performance improvements by claude by ypriverol · Pull Request #560 · bigbio/pmultiqc

ypriverol · 2026-01-18T17:22:25Z

Pull Request

Description

Brief description of the changes made in this PR.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Performance improvement
Code refactoring
Test addition/update
Updates to the dependencies has been done.

Summary by CodeRabbit

Refactor
- Optimized mass spectrometry data processing pipelines to improve performance through batch operations and vectorized calculations.
- Enhanced histogram aggregation and distribution computations for more efficient frequency calculations.
- Streamlined data filtering and aggregation operations across multiple modules for better overall responsiveness.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Key optimizations: - Add batch processing method (add_values_batch) to Histogram class using numpy vectorized operations (searchsorted for bin assignment) - Remove redundant sorting in histogram.add_value() for range mode - Combine multiple groupby calls into single aggregation in ms_io.get_ms_qc_info() - Vectorize drop_empty_row() using pandas DataFrame operations - Replace apply(lambda) with np.where in evidence_oversampling() - Optimize evidence_peptide_intensity() with vectorized fillna/where - Optimize evidence_calibrated_mass_error() to derive frequency from counts - Fix duplicate rt_rl_compute() call in evidence_peak_width_rt() - Batch histogram updates in MzMLReader instead of per-spectrum calls - Use set for faster spectrum ID lookups in MzMLReader - Optimize pg_intensity_distr() with numpy array operations These changes improve performance by reducing Python-level iteration and leveraging numpy/pandas vectorized operations.

…2W3NR Performance improvements and test verification

coderabbitai · 2026-01-18T17:22:56Z

📝 Walkthrough

Walkthrough

This pull request optimizes histogram data processing and utility functions by introducing batch processing methods and replacing iterative operations with vectorized computations. Changes include adding a new add_values_batch method to the Histogram class, refactoring spectrum data collection in mzML processing to defer plotting until batch completion, consolidating DataFrame and groupby operations, and replacing per-row/per-element loops with NumPy-based vectorized approaches across multiple utility modules.

Changes

Cohort / File(s)	Summary
Histogram batching infrastructure `pmultiqc/modules/common/histogram.py`	Added `add_values_batch()` method supporting batch insertion for both frequency and range plot categories; removed redundant sort operation; imported numpy dependency
Batch histogram integration `pmultiqc/modules/common/ms/mzml.py`, `pmultiqc/modules/common/ms_io.py`	Refactored spectrum data collection to accumulate values in batches (charge states, base peak intensities, MS2 peaks) separated by DIA mode and identification status; deferred histogram updates until after processing all spectra in a file; consolidated per-element histogram additions into batch operations
Vectorization optimizations `pmultiqc/modules/maxquant/maxquant_utils.py`	Replaced per-row apply operations with vectorized NumPy operations and fillna logic for intensity distribution, peptide intensity, peak width, and oversampling calculations
Utility function optimizations `pmultiqc/modules/common/common_utils.py`, `pmultiqc/modules/common/file_utils.py`	Refactored frequency calculation to use precomputed counts; replaced iterative mask construction with single vectorized boolean operation on DataFrame subset

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Dev #524 — Modifies related frequency/count logic in evidence_calibrated_mass_error and stats.py, sharing optimization patterns with this PR's common_utils changes

Suggested labels

Review effort 3/5

Suggested reviewers

daichengxin

Poem

🐰 Batches hop and vectorize with glee,
No more loops—efficiency!
Histograms pile high, then plot with grace,
NumPy speeds up every race. ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title is vague and generic, using the phrase 'Performance improvements by claude' without conveying specific details about what aspects of the codebase were optimized.	Provide a more descriptive title that highlights the main optimization targets (e.g., 'Optimize histogram batch processing and vectorize MS data aggregation') to help reviewers quickly understand the scope of changes.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@pmultiqc/modules/common/histogram.py`:
- Around line 148-213: The batch-mode threshold handling is inconsistent with
add_value: in histogram.py (symbols: plot_category, out_threshold,
below_threshold, above_count, bin_indices, self.breaks, self.bins) batch code
treats values equal to threshold as "above" (using arr < threshold and
~below_threshold) while add_value only flags strictly greater values; make
semantics consistent by treating equality as non-"above" in batch mode — change
the comparisons so below_threshold uses <= self.threshold (or arr_float <=
self.threshold in range mode) and compute above_count as np.sum(arr >
self.threshold) (or np.sum(arr_float > self.threshold)), keep setting
out_threshold only when above_count > 0, and adjust bin_indices generation to
use the new below_values (values <= threshold) so bin assignment remains
correct.

coderabbitai · 2026-01-18T17:28:49Z

pmultiqc/modules/common/histogram.py

+        if self.plot_category == 1:
+            # Frequency mode
+            if self.breaks:
+                below_threshold = arr < self.threshold
+                below_values = arr[below_threshold]
+                above_count = np.sum(~below_threshold)
+
+                # Count values below threshold
+                for val in below_values:
+                    val_str = str(int(val) if isinstance(val, (int, np.integer, float)) and float(val).is_integer() else val)
+                    if val_str in self.data:
+                        self.data[val_str][stack] += 1
+
+                # Add above threshold count
+                if above_count > 0:
+                    self.out_threshold = True
+                    self.data[self.bins[-1]][stack] += above_count
+            else:
+                # No breaks - use value_counts approach
+                unique, counts = np.unique(arr, return_counts=True)
+                for val, count in zip(unique, counts):
+                    val_str = str(val)
+                    if val_str in self.bins:
+                        self.data[val_str][stack] += count
+                    else:
+                        self.bins.append(val_str)
+                        if self.stacks:
+                            self.data[val_str] = dict.fromkeys(self.stacks, 0)
+                            self.data[val_str][stack] = count
+                        else:
+                            self.data[val_str] = {stack: count}
+
+                # Sort if needed (only for non-string values)
+                if len(arr) > 0 and not isinstance(arr[0], str) and not self.stacks:
+                    try:
+                        data_keys = [type(arr[0])(i) for i in self.data.keys()]
+                        data_keys.sort()
+                        self.data = OrderedDict(
+                            {str(i): {"total": self.data[str(i)]["total"]} for i in data_keys}
+                        )
+                    except (ValueError, TypeError):
+                        pass  # Keep original order if sorting fails
+
+        elif self.plot_category == 2:
+            # Range mode - use numpy searchsorted for vectorized bin assignment
+            arr_float = arr.astype(float)
+            below_threshold = arr_float < self.threshold
+            below_values = arr_float[below_threshold]
+            above_count = np.sum(~below_threshold)
+
+            if len(below_values) > 0:
+                # Use searchsorted for vectorized binary search
+                # searchsorted returns index where value would be inserted to maintain order
+                # We use side='right' and subtract 1 to get the correct bin
+                bin_indices = np.searchsorted(self.breaks, below_values, side='right') - 1
+                bin_indices = np.clip(bin_indices, 0, len(self.bins) - 2)  # Ensure valid indices
+
+                # Count occurrences in each bin
+                unique_bins, bin_counts = np.unique(bin_indices, return_counts=True)
+                for bin_idx, count in zip(unique_bins, bin_counts):
+                    self.data[self.bins[bin_idx]][stack] += count
+
+            if above_count > 0:
+                self.out_threshold = True
+                self.data[self.bins[-1]][stack] += above_count
+


⚠️ Potential issue | 🟡 Minor

Keep out_threshold semantics consistent for equality cases.
In batch mode, any value equal to the threshold sets out_threshold=True, but add_value only sets it for strictly greater values. This can change the last-bin label in to_dict. Consider separating counts for >= vs > to preserve existing behavior.

🔧 Proposed fix

- below_threshold = arr < self.threshold - below_values = arr[below_threshold] - above_count = np.sum(~below_threshold) + below_threshold = arr < self.threshold + below_values = arr[below_threshold] + ge_count = np.sum(~below_threshold) + gt_count = np.sum(arr > self.threshold) # Count values below threshold for val in below_values: val_str = str(int(val) if isinstance(val, (int, np.integer, float)) and float(val).is_integer() else val) if val_str in self.data: self.data[val_str][stack] += 1 # Add above threshold count - if above_count > 0: - self.out_threshold = True - self.data[self.bins[-1]][stack] += above_count + if ge_count > 0: + self.data[self.bins[-1]][stack] += ge_count + if gt_count > 0: + self.out_threshold = True

- below_threshold = arr_float < self.threshold - below_values = arr_float[below_threshold] - above_count = np.sum(~below_threshold) + below_threshold = arr_float < self.threshold + below_values = arr_float[below_threshold] + ge_count = np.sum(~below_threshold) + gt_count = np.sum(arr_float > self.threshold) if len(below_values) > 0: ... - if above_count > 0: - self.out_threshold = True - self.data[self.bins[-1]][stack] += above_count + if ge_count > 0: + self.data[self.bins[-1]][stack] += ge_count + if gt_count > 0: + self.out_threshold = True

🧰 Tools

🪛 Ruff (0.14.13)

168-168: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

207-207: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents

In `@pmultiqc/modules/common/histogram.py` around lines 148 - 213, The batch-mode threshold handling is inconsistent with add_value: in histogram.py (symbols: plot_category, out_threshold, below_threshold, above_count, bin_indices, self.breaks, self.bins) batch code treats values equal to threshold as "above" (using arr < threshold and ~below_threshold) while add_value only flags strictly greater values; make semantics consistent by treating equality as non-"above" in batch mode — change the comparisons so below_threshold uses <= self.threshold (or arr_float <= self.threshold in range mode) and compute above_count as np.sum(arr > self.threshold) (or np.sum(arr_float > self.threshold)), keep setting out_threshold only when above_count > 0, and adjust bin_indices generation to use the new below_values (values <= threshold) so bin assignment remains correct.

claude and others added 2 commits January 18, 2026 17:11

Merge pull request #559 from bigbio/claude/improve-performance-tests-…

0346158

…2W3NR Performance improvements and test verification

coderabbitai bot reviewed Jan 18, 2026

View reviewed changes

ypriverol merged commit c146be4 into main Jan 18, 2026
22 of 23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvements by claude #560

Performance improvements by claude #560
ypriverol merged 2 commits intomainfrom
dev

ypriverol commented Jan 18, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 18, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ypriverol commented Jan 18, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request

Description

Type of Change

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ypriverol commented Jan 18, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 18, 2026 •

edited

Loading