Conversation
Key optimizations: - Add batch processing method (add_values_batch) to Histogram class using numpy vectorized operations (searchsorted for bin assignment) - Remove redundant sorting in histogram.add_value() for range mode - Combine multiple groupby calls into single aggregation in ms_io.get_ms_qc_info() - Vectorize drop_empty_row() using pandas DataFrame operations - Replace apply(lambda) with np.where in evidence_oversampling() - Optimize evidence_peptide_intensity() with vectorized fillna/where - Optimize evidence_calibrated_mass_error() to derive frequency from counts - Fix duplicate rt_rl_compute() call in evidence_peak_width_rt() - Batch histogram updates in MzMLReader instead of per-spectrum calls - Use set for faster spectrum ID lookups in MzMLReader - Optimize pg_intensity_distr() with numpy array operations These changes improve performance by reducing Python-level iteration and leveraging numpy/pandas vectorized operations.
…2W3NR Performance improvements and test verification
📝 WalkthroughWalkthroughThis pull request optimizes histogram data processing and utility functions by introducing batch processing methods and replacing iterative operations with vectorized computations. Changes include adding a new Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@pmultiqc/modules/common/histogram.py`:
- Around line 148-213: The batch-mode threshold handling is inconsistent with
add_value: in histogram.py (symbols: plot_category, out_threshold,
below_threshold, above_count, bin_indices, self.breaks, self.bins) batch code
treats values equal to threshold as "above" (using arr < threshold and
~below_threshold) while add_value only flags strictly greater values; make
semantics consistent by treating equality as non-"above" in batch mode — change
the comparisons so below_threshold uses <= self.threshold (or arr_float <=
self.threshold in range mode) and compute above_count as np.sum(arr >
self.threshold) (or np.sum(arr_float > self.threshold)), keep setting
out_threshold only when above_count > 0, and adjust bin_indices generation to
use the new below_values (values <= threshold) so bin assignment remains
correct.
| if self.plot_category == 1: | ||
| # Frequency mode | ||
| if self.breaks: | ||
| below_threshold = arr < self.threshold | ||
| below_values = arr[below_threshold] | ||
| above_count = np.sum(~below_threshold) | ||
|
|
||
| # Count values below threshold | ||
| for val in below_values: | ||
| val_str = str(int(val) if isinstance(val, (int, np.integer, float)) and float(val).is_integer() else val) | ||
| if val_str in self.data: | ||
| self.data[val_str][stack] += 1 | ||
|
|
||
| # Add above threshold count | ||
| if above_count > 0: | ||
| self.out_threshold = True | ||
| self.data[self.bins[-1]][stack] += above_count | ||
| else: | ||
| # No breaks - use value_counts approach | ||
| unique, counts = np.unique(arr, return_counts=True) | ||
| for val, count in zip(unique, counts): | ||
| val_str = str(val) | ||
| if val_str in self.bins: | ||
| self.data[val_str][stack] += count | ||
| else: | ||
| self.bins.append(val_str) | ||
| if self.stacks: | ||
| self.data[val_str] = dict.fromkeys(self.stacks, 0) | ||
| self.data[val_str][stack] = count | ||
| else: | ||
| self.data[val_str] = {stack: count} | ||
|
|
||
| # Sort if needed (only for non-string values) | ||
| if len(arr) > 0 and not isinstance(arr[0], str) and not self.stacks: | ||
| try: | ||
| data_keys = [type(arr[0])(i) for i in self.data.keys()] | ||
| data_keys.sort() | ||
| self.data = OrderedDict( | ||
| {str(i): {"total": self.data[str(i)]["total"]} for i in data_keys} | ||
| ) | ||
| except (ValueError, TypeError): | ||
| pass # Keep original order if sorting fails | ||
|
|
||
| elif self.plot_category == 2: | ||
| # Range mode - use numpy searchsorted for vectorized bin assignment | ||
| arr_float = arr.astype(float) | ||
| below_threshold = arr_float < self.threshold | ||
| below_values = arr_float[below_threshold] | ||
| above_count = np.sum(~below_threshold) | ||
|
|
||
| if len(below_values) > 0: | ||
| # Use searchsorted for vectorized binary search | ||
| # searchsorted returns index where value would be inserted to maintain order | ||
| # We use side='right' and subtract 1 to get the correct bin | ||
| bin_indices = np.searchsorted(self.breaks, below_values, side='right') - 1 | ||
| bin_indices = np.clip(bin_indices, 0, len(self.bins) - 2) # Ensure valid indices | ||
|
|
||
| # Count occurrences in each bin | ||
| unique_bins, bin_counts = np.unique(bin_indices, return_counts=True) | ||
| for bin_idx, count in zip(unique_bins, bin_counts): | ||
| self.data[self.bins[bin_idx]][stack] += count | ||
|
|
||
| if above_count > 0: | ||
| self.out_threshold = True | ||
| self.data[self.bins[-1]][stack] += above_count | ||
|
|
There was a problem hiding this comment.
Keep out_threshold semantics consistent for equality cases.
In batch mode, any value equal to the threshold sets out_threshold=True, but add_value only sets it for strictly greater values. This can change the last-bin label in to_dict. Consider separating counts for >= vs > to preserve existing behavior.
🔧 Proposed fix
- below_threshold = arr < self.threshold
- below_values = arr[below_threshold]
- above_count = np.sum(~below_threshold)
+ below_threshold = arr < self.threshold
+ below_values = arr[below_threshold]
+ ge_count = np.sum(~below_threshold)
+ gt_count = np.sum(arr > self.threshold)
# Count values below threshold
for val in below_values:
val_str = str(int(val) if isinstance(val, (int, np.integer, float)) and float(val).is_integer() else val)
if val_str in self.data:
self.data[val_str][stack] += 1
# Add above threshold count
- if above_count > 0:
- self.out_threshold = True
- self.data[self.bins[-1]][stack] += above_count
+ if ge_count > 0:
+ self.data[self.bins[-1]][stack] += ge_count
+ if gt_count > 0:
+ self.out_threshold = True- below_threshold = arr_float < self.threshold
- below_values = arr_float[below_threshold]
- above_count = np.sum(~below_threshold)
+ below_threshold = arr_float < self.threshold
+ below_values = arr_float[below_threshold]
+ ge_count = np.sum(~below_threshold)
+ gt_count = np.sum(arr_float > self.threshold)
if len(below_values) > 0:
...
- if above_count > 0:
- self.out_threshold = True
- self.data[self.bins[-1]][stack] += above_count
+ if ge_count > 0:
+ self.data[self.bins[-1]][stack] += ge_count
+ if gt_count > 0:
+ self.out_threshold = True🧰 Tools
🪛 Ruff (0.14.13)
168-168: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
207-207: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
🤖 Prompt for AI Agents
In `@pmultiqc/modules/common/histogram.py` around lines 148 - 213, The batch-mode
threshold handling is inconsistent with add_value: in histogram.py (symbols:
plot_category, out_threshold, below_threshold, above_count, bin_indices,
self.breaks, self.bins) batch code treats values equal to threshold as "above"
(using arr < threshold and ~below_threshold) while add_value only flags strictly
greater values; make semantics consistent by treating equality as non-"above" in
batch mode — change the comparisons so below_threshold uses <= self.threshold
(or arr_float <= self.threshold in range mode) and compute above_count as
np.sum(arr > self.threshold) (or np.sum(arr_float > self.threshold)), keep
setting out_threshold only when above_count > 0, and adjust bin_indices
generation to use the new below_values (values <= threshold) so bin assignment
remains correct.
Pull Request
Description
Brief description of the changes made in this PR.
Type of Change
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.