Skip to content

Performance improvements by claude #560

Merged
ypriverol merged 2 commits intomainfrom
dev
Jan 18, 2026
Merged

Performance improvements by claude #560
ypriverol merged 2 commits intomainfrom
dev

Conversation

@ypriverol
Copy link
Member

@ypriverol ypriverol commented Jan 18, 2026

Pull Request

Description

Brief description of the changes made in this PR.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Performance improvement
  • Code refactoring
  • Test addition/update
  • Updates to the dependencies has been done.

Summary by CodeRabbit

  • Refactor
    • Optimized mass spectrometry data processing pipelines to improve performance through batch operations and vectorized calculations.
    • Enhanced histogram aggregation and distribution computations for more efficient frequency calculations.
    • Streamlined data filtering and aggregation operations across multiple modules for better overall responsiveness.

✏️ Tip: You can customize this high-level summary in your review settings.

claude and others added 2 commits January 18, 2026 17:11
Key optimizations:
- Add batch processing method (add_values_batch) to Histogram class using
  numpy vectorized operations (searchsorted for bin assignment)
- Remove redundant sorting in histogram.add_value() for range mode
- Combine multiple groupby calls into single aggregation in ms_io.get_ms_qc_info()
- Vectorize drop_empty_row() using pandas DataFrame operations
- Replace apply(lambda) with np.where in evidence_oversampling()
- Optimize evidence_peptide_intensity() with vectorized fillna/where
- Optimize evidence_calibrated_mass_error() to derive frequency from counts
- Fix duplicate rt_rl_compute() call in evidence_peak_width_rt()
- Batch histogram updates in MzMLReader instead of per-spectrum calls
- Use set for faster spectrum ID lookups in MzMLReader
- Optimize pg_intensity_distr() with numpy array operations

These changes improve performance by reducing Python-level iteration
and leveraging numpy/pandas vectorized operations.
…2W3NR

Performance improvements and test verification
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 18, 2026

📝 Walkthrough

Walkthrough

This pull request optimizes histogram data processing and utility functions by introducing batch processing methods and replacing iterative operations with vectorized computations. Changes include adding a new add_values_batch method to the Histogram class, refactoring spectrum data collection in mzML processing to defer plotting until batch completion, consolidating DataFrame and groupby operations, and replacing per-row/per-element loops with NumPy-based vectorized approaches across multiple utility modules.

Changes

Cohort / File(s) Summary
Histogram batching infrastructure
pmultiqc/modules/common/histogram.py
Added add_values_batch() method supporting batch insertion for both frequency and range plot categories; removed redundant sort operation; imported numpy dependency
Batch histogram integration
pmultiqc/modules/common/ms/mzml.py, pmultiqc/modules/common/ms_io.py
Refactored spectrum data collection to accumulate values in batches (charge states, base peak intensities, MS2 peaks) separated by DIA mode and identification status; deferred histogram updates until after processing all spectra in a file; consolidated per-element histogram additions into batch operations
Vectorization optimizations
pmultiqc/modules/maxquant/maxquant_utils.py
Replaced per-row apply operations with vectorized NumPy operations and fillna logic for intensity distribution, peptide intensity, peak width, and oversampling calculations
Utility function optimizations
pmultiqc/modules/common/common_utils.py, pmultiqc/modules/common/file_utils.py
Refactored frequency calculation to use precomputed counts; replaced iterative mask construction with single vectorized boolean operation on DataFrame subset

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • Dev #524 — Modifies related frequency/count logic in evidence_calibrated_mass_error and stats.py, sharing optimization patterns with this PR's common_utils changes

Suggested labels

Review effort 3/5

Suggested reviewers

  • daichengxin

Poem

🐰 Batches hop and vectorize with glee,
No more loops—efficiency!
Histograms pile high, then plot with grace,
NumPy speeds up every race. ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 37.50% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title is vague and generic, using the phrase 'Performance improvements by claude' without conveying specific details about what aspects of the codebase were optimized. Provide a more descriptive title that highlights the main optimization targets (e.g., 'Optimize histogram batch processing and vectorize MS data aggregation') to help reviewers quickly understand the scope of changes.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@pmultiqc/modules/common/histogram.py`:
- Around line 148-213: The batch-mode threshold handling is inconsistent with
add_value: in histogram.py (symbols: plot_category, out_threshold,
below_threshold, above_count, bin_indices, self.breaks, self.bins) batch code
treats values equal to threshold as "above" (using arr < threshold and
~below_threshold) while add_value only flags strictly greater values; make
semantics consistent by treating equality as non-"above" in batch mode — change
the comparisons so below_threshold uses <= self.threshold (or arr_float <=
self.threshold in range mode) and compute above_count as np.sum(arr >
self.threshold) (or np.sum(arr_float > self.threshold)), keep setting
out_threshold only when above_count > 0, and adjust bin_indices generation to
use the new below_values (values <= threshold) so bin assignment remains
correct.

Comment on lines +148 to +213
if self.plot_category == 1:
# Frequency mode
if self.breaks:
below_threshold = arr < self.threshold
below_values = arr[below_threshold]
above_count = np.sum(~below_threshold)

# Count values below threshold
for val in below_values:
val_str = str(int(val) if isinstance(val, (int, np.integer, float)) and float(val).is_integer() else val)
if val_str in self.data:
self.data[val_str][stack] += 1

# Add above threshold count
if above_count > 0:
self.out_threshold = True
self.data[self.bins[-1]][stack] += above_count
else:
# No breaks - use value_counts approach
unique, counts = np.unique(arr, return_counts=True)
for val, count in zip(unique, counts):
val_str = str(val)
if val_str in self.bins:
self.data[val_str][stack] += count
else:
self.bins.append(val_str)
if self.stacks:
self.data[val_str] = dict.fromkeys(self.stacks, 0)
self.data[val_str][stack] = count
else:
self.data[val_str] = {stack: count}

# Sort if needed (only for non-string values)
if len(arr) > 0 and not isinstance(arr[0], str) and not self.stacks:
try:
data_keys = [type(arr[0])(i) for i in self.data.keys()]
data_keys.sort()
self.data = OrderedDict(
{str(i): {"total": self.data[str(i)]["total"]} for i in data_keys}
)
except (ValueError, TypeError):
pass # Keep original order if sorting fails

elif self.plot_category == 2:
# Range mode - use numpy searchsorted for vectorized bin assignment
arr_float = arr.astype(float)
below_threshold = arr_float < self.threshold
below_values = arr_float[below_threshold]
above_count = np.sum(~below_threshold)

if len(below_values) > 0:
# Use searchsorted for vectorized binary search
# searchsorted returns index where value would be inserted to maintain order
# We use side='right' and subtract 1 to get the correct bin
bin_indices = np.searchsorted(self.breaks, below_values, side='right') - 1
bin_indices = np.clip(bin_indices, 0, len(self.bins) - 2) # Ensure valid indices

# Count occurrences in each bin
unique_bins, bin_counts = np.unique(bin_indices, return_counts=True)
for bin_idx, count in zip(unique_bins, bin_counts):
self.data[self.bins[bin_idx]][stack] += count

if above_count > 0:
self.out_threshold = True
self.data[self.bins[-1]][stack] += above_count

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Keep out_threshold semantics consistent for equality cases.
In batch mode, any value equal to the threshold sets out_threshold=True, but add_value only sets it for strictly greater values. This can change the last-bin label in to_dict. Consider separating counts for >= vs > to preserve existing behavior.

🔧 Proposed fix
-                below_threshold = arr < self.threshold
-                below_values = arr[below_threshold]
-                above_count = np.sum(~below_threshold)
+                below_threshold = arr < self.threshold
+                below_values = arr[below_threshold]
+                ge_count = np.sum(~below_threshold)
+                gt_count = np.sum(arr > self.threshold)

                 # Count values below threshold
                 for val in below_values:
                     val_str = str(int(val) if isinstance(val, (int, np.integer, float)) and float(val).is_integer() else val)
                     if val_str in self.data:
                         self.data[val_str][stack] += 1

                 # Add above threshold count
-                if above_count > 0:
-                    self.out_threshold = True
-                    self.data[self.bins[-1]][stack] += above_count
+                if ge_count > 0:
+                    self.data[self.bins[-1]][stack] += ge_count
+                if gt_count > 0:
+                    self.out_threshold = True
-            below_threshold = arr_float < self.threshold
-            below_values = arr_float[below_threshold]
-            above_count = np.sum(~below_threshold)
+            below_threshold = arr_float < self.threshold
+            below_values = arr_float[below_threshold]
+            ge_count = np.sum(~below_threshold)
+            gt_count = np.sum(arr_float > self.threshold)

             if len(below_values) > 0:
                 ...
-            if above_count > 0:
-                self.out_threshold = True
-                self.data[self.bins[-1]][stack] += above_count
+            if ge_count > 0:
+                self.data[self.bins[-1]][stack] += ge_count
+            if gt_count > 0:
+                self.out_threshold = True
🧰 Tools
🪛 Ruff (0.14.13)

168-168: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)


207-207: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents
In `@pmultiqc/modules/common/histogram.py` around lines 148 - 213, The batch-mode
threshold handling is inconsistent with add_value: in histogram.py (symbols:
plot_category, out_threshold, below_threshold, above_count, bin_indices,
self.breaks, self.bins) batch code treats values equal to threshold as "above"
(using arr < threshold and ~below_threshold) while add_value only flags strictly
greater values; make semantics consistent by treating equality as non-"above" in
batch mode — change the comparisons so below_threshold uses <= self.threshold
(or arr_float <= self.threshold in range mode) and compute above_count as
np.sum(arr > self.threshold) (or np.sum(arr_float > self.threshold)), keep
setting out_threshold only when above_count > 0, and adjust bin_indices
generation to use the new below_values (values <= threshold) so bin assignment
remains correct.

@ypriverol ypriverol merged commit c146be4 into main Jan 18, 2026
22 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants