Improving FragPipe by ypriverol · Pull Request #544 · bigbio/pmultiqc

ypriverol · 2026-01-16T08:22:14Z

Pull Request

Description

Brief description of the changes made in this PR.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Performance improvement
Code refactoring
Test addition/update
Updates to the dependencies has been done.

Summary by CodeRabbit

New Features
- Significantly expanded FragPipe module with new visualizations: parameters table, experiment design table, protein intensity distribution with contaminant tracking, and MS/MS counts per peak analysis.
- Automatic intelligent grouping of FragPipe output files by experiment for streamlined processing.
Documentation
- Updated example reports with new FragPipe reference dataset.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

This commit addresses issue #541 by adding comprehensive FragPipe QC support similar to MaxQuant. New features include: **Parameters Section:** - Parse fragpipe.workflow file for configuration parameters - Display key settings (enzyme, mass tolerances, modifications, MBR, etc.) - Added search pattern for *.workflow files **Experiment Design:** - Parse fp-manifest file for experiment design information - Display file names, experiments, bioreplicates, and data types - Added search pattern for *.fp-manifest files **Protein Intensity Distribution:** - Parse combined_protein.tsv for protein-level quantification - Generate boxplot of log2-transformed protein intensities - Separate visualization for sample vs contaminant proteins **Match-Between-Runs (MBR) Visualization:** - Analyze combined_protein.tsv and combined_peptide.tsv for MBR contribution - Display stacked bar chart showing MS/MS only, MBR only, and both categories - Helps assess MBR quality and potential false positives **MS/MS Counts Per Peak:** - Parse combined_ion.tsv for MS/MS spectral counts - Generate boxplot of spectral count distribution per sample - Alternative bar chart view for identification status All new features use graceful fallback - missing files don't prevent report generation. This follows the same pattern used by MaxQuant and DIA-NN modules. Resolves: #541

Add parsing of MSFragger's fragger.params file for search engine parameters. This provides additional QC information and serves as a fallback when fragpipe.workflow is not available. Changes: - Add fragger_params_reader() function to parse fragger.params files - Update get_workflow_parameters_table() to merge parameters from both workflow and fragger.params files (workflow takes precedence) - Add search pattern for *.params files - Add additional MSFragger-specific parameters to the parameters table: - Number of Threads - Decoy Prefix - Isotope Error - Mass Offsets - Precursor True Tolerance/Units - Calibrate Mass - Clip N-term Met - Min/Max Peptide Length The implementation follows the same graceful fallback pattern - if either file is available, parameters will be displayed. If both are available, workflow parameters take precedence with fragger.params providing additional information.

…ALIA Improve FragPipe support and file handling

coderabbitai · 2026-01-16T08:22:49Z

📝 Walkthrough

Walkthrough

This PR substantially expands FragPipe integration by introducing I/O handlers for workflow parameters, manifest files, and combined protein/peptide/ion outputs. It adds experiment-aware file grouping in the service layer, refactors shared plotting functions to accept data_type parameters, and updates test/documentation datasets from PXD066146 to PXD062399.

Changes

Cohort / File(s)	Summary
FragPipe Module Expansion `pmultiqc/modules/fragpipe/fragpipe.py`, `pmultiqc/modules/fragpipe/fragpipe_io.py`	Added 12+ new I/O reader functions (workflow_reader, fragger_params_reader, manifest_reader, combined_protein/peptide/ion_reader, etc.); new data processing helpers (get_workflow_parameters_table, get_experiment_design_table, get_protein_intensity_distribution, get_msms_counts_per_peak, cal_peptide_id_gain); 4 new static plotting methods (draw_parameters, draw_experiment_design, draw_protein_intensity_distribution, draw_mbr_contribution); extended FragPipeModule to load and render workflow params, manifest, protein/peptide/ion data.
Configuration & Service Layer `pmultiqc/main.py`, `pmultiqc_service/app.py`	Added default MultiQC config entries for FragPipe file patterns (.workflow, .fp-manifest, *.params); introduced FRAGPIPE_FILE_SUFFIXES constant and group_fragpipe_files() function to enable experiment-aware file grouping and batch processing in PRIDE job handling.
Plotting Infrastructure Refactor `pmultiqc/modules/common/plots/id.py`, `pmultiqc/modules/maxquant/maxquant.py`, `pmultiqc/modules/maxquant/maxquant_plots.py`, `pmultiqc/modules/mzidentml/mzidentml.py`, `pmultiqc/modules/quantms/quantms.py`	Refactored draw_identification to accept optional draw_peptide_id_count flag; replaced is_maxquant boolean with data_type: str parameter in draw_oversampling; updated draw_evidence_peptide_id_count signature to accept data_type and render conditional description text; propagated data_type argument across maxquant, mzidentml, and quantms modules.
Test Data & Documentation `.github/workflows/python-app.yml`, `docs/config.json`, `docs/README.md`	Updated FragPipe test workflow to download and extract PXD062399.zip instead of individual TSV files; updated example dataset references from PXD066146 to PXD062399 in config and README.

Sequence Diagram

sequenceDiagram
    participant User
    participant Service as pmultiqc_service.app
    participant Discovery as File Discovery
    participant Readers as fragpipe_io Readers
    participant Module as FragPipeModule
    participant Plots as Plotting Functions

    User->>Service: Submit PRIDE job (FragPipe files)
    Service->>Discovery: Scan downloaded files
    Discovery-->>Service: Return fragpipe files list
    Service->>Service: group_fragpipe_files() by experiment
    
    loop For each experiment group
        Service->>Readers: Load workflow, manifest, combined_*
        Readers->>Readers: Parse/validate files
        Readers-->>Module: Return structured data
        
        Module->>Readers: Process workflow + fragger params
        Readers-->>Module: parameters_table
        
        Module->>Readers: Extract manifest
        Readers-->>Module: experiment_design_table
        
        Module->>Readers: Load combined_protein/peptide/ion
        Readers-->>Module: protein_df, peptide_df, ion_df
        Module->>Module: Compute distributions & metrics
        
        Module->>Plots: Render parameters, experiment design
        Module->>Plots: Render protein intensity distribution
        Module->>Plots: Render MS/MS counts (via draw_oversampling)
        Plots-->>User: MultiQC report with FragPipe sections
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Adding Quant plots to FragPipe based on ion.tsv #533: Adds ion-level data ingestion and ion intensity distribution plotting in FragPipe module using combined_ion readers
FragPipe first implementation #495: Implements initial FragPipe support modifying the same fragpipe.py and fragpipe_io.py files plus configuration/CI entries
Minor changes in the FragPipe support #520: Refactors draw_identification and draw_oversampling signatures in pmultiqc/modules/common/plots/id.py that directly relate to this PR's plotting function updates

Suggested labels

Review effort 4/5, feature, fragpipe, service-layer

Suggested reviewers

yueqixuan
daichengxin

Poem

🐰 FragPipe flows now through our pipes so clear,
With manifests and proteins dancing here,
Experiment designs in tables bright,
MS/MS counts plotting through the night,
A rabbit hops through peptides with delight! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 62.16% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check	❓ Inconclusive	The title 'Improving FragPipe' is vague and non-descriptive, using generic phrasing that doesn't clearly convey the main changes (comprehensive QC support, file parsing, visualization features).	Use a more specific title that captures the main enhancement, such as 'Add comprehensive FragPipe QC visualizations and MSFragger parameter parsing' or 'Enhance FragPipe integration with combined output analysis'.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@pmultiqc/modules/fragpipe/fragpipe_io.py`:
- Around line 761-793: The code initializes peptide-level entries in mbr_stats
but never fills them; either remove the 'peptides' initialization or implement
peptide-level MBR calculation analogous to the protein block: when a peptide
dataframe (e.g., peptide_df or peptides_df) is available, locate the matching
spectral count column in that dataframe (similar logic used for protein_df), set
intensity_col = sample, compute has_spectral, has_intensity, then compute
msms_only, mbr_only, both and assign them into mbr_stats[sample]['peptides'] as
ints (mirroring the structure used for mbr_stats[sample]['proteins']).

🧹 Nitpick comments (6)

pmultiqc/modules/fragpipe/fragpipe_io.py (3)

66-77: Consider using exact match for psm.tsv for consistency.

Lines 68-75 use exact filename matching for ion.tsv, combined_ion.tsv, combined_protein.tsv, and combined_peptide.tsv. However, line 76 uses "psm" in filename which could unintentionally match files like mypsm.tsv or psm_backup.tsv.

♻️ Suggested fix for consistent matching

         elif filename == "combined_peptide.tsv":
             fragpipe_files["combined_peptide"].append(full_path)
-        elif "psm" in filename:
+        elif filename == "psm.tsv":
             fragpipe_files["psm"].append(full_path)

318-353: Consider extracting common key-value parsing logic.

fragger_params_reader and workflow_reader share nearly identical parsing logic. Consider extracting a shared helper function.

♻️ Optional refactor to reduce duplication

def _parse_key_value_file(file_path: str, file_type: str) -> dict | None:
    """Parse a key=value format file."""
    parameters = {}
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            for line in f:
                line = line.strip()
                if not line or line.startswith('#'):
                    continue
                if '=' in line:
                    key, value = line.split('=', 1)
                    parameters[key.strip()] = value.strip()
    except Exception as e:
        log.warning(f"Error reading {file_type} file {file_path}: {e}")
        return None
    log.info(f"Loaded {len(parameters)} parameters from {file_type} file")
    return parameters

def workflow_reader(file_path: str):
    return _parse_key_value_file(file_path, "workflow")

def fragger_params_reader(file_path: str):
    return _parse_key_value_file(file_path, "fragger.params")

356-361: Use explicit Optional or union type for optional parameter.

Per PEP 484, use explicit Optional[dict] or dict | None instead of dict = None.

♻️ Suggested fix

-def get_workflow_parameters_table(parameters: dict, fragger_params: dict = None):
+def get_workflow_parameters_table(parameters: dict, fragger_params: dict | None = None):

pmultiqc/modules/fragpipe/fragpipe.py (3)

202-225: Prefix unused unpacked variables with underscore.

mbr_cols (line 205) and mbr_info (line 222) are unpacked but never used. Prefix with underscore to indicate intentional non-use.

♻️ Suggested fix

-                self.protein_df, self.protein_intensity_cols, mbr_cols = combined_protein_reader(protein_path)
+                self.protein_df, self.protein_intensity_cols, _mbr_cols = combined_protein_reader(protein_path)

-                self.peptide_df, self.peptide_sample_cols, mbr_info = combined_peptide_reader(peptide_path)
+                self.peptide_df, self.peptide_sample_cols, _mbr_info = combined_peptide_reader(peptide_path)

1123-1127: Use explicit Optional or union type for optional parameter.

Per PEP 484, use explicit type annotation for optional parameters.

♻️ Suggested fix

     `@staticmethod`
-    def draw_protein_intensity_distribution(sub_section, sample_distribution: dict, contam_distribution: dict = None):
+    def draw_protein_intensity_distribution(sub_section, sample_distribution: dict, contam_distribution: dict | None = None):

1212-1222: Consider using any() for cleaner validation.

The loop variable sample is unused. Consider using any() with a generator expression.

♻️ Suggested refactor

-        # Check if we have any meaningful MBR data
-        has_data = False
-        for sample, stats in mbr_stats.items():
-            proteins = stats.get('proteins', {})
-            if proteins.get('mbr_only', 0) > 0 or proteins.get('both', 0) > 0:
-                has_data = True
-                break
-
-        if not has_data:
+        # Check if we have any meaningful MBR data
+        has_data = any(
+            stats.get('proteins', {}).get('mbr_only', 0) > 0 or 
+            stats.get('proteins', {}).get('both', 0) > 0
+            for stats in mbr_stats.values()
+        )
+        if not has_data:

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e6053ac and 1819a98.

📒 Files selected for processing (3)

pmultiqc/main.py
pmultiqc/modules/fragpipe/fragpipe.py
pmultiqc/modules/fragpipe/fragpipe_io.py

🧰 Additional context used

🧬 Code graph analysis (1)

pmultiqc/modules/fragpipe/fragpipe.py (1)

pmultiqc/modules/fragpipe/fragpipe_io.py (11)

workflow_reader (282-315)

fragger_params_reader (318-353)

get_workflow_parameters_table (356-461)

manifest_reader (464-508)

get_experiment_design_table (511-544)

combined_protein_reader (547-611)

get_protein_intensity_distribution (614-671)

combined_peptide_reader (674-726)

get_mbr_stats (729-793)

combined_ion_reader (796-864)

get_msms_counts_per_peak (867-917)

🪛 Ruff (0.14.11)

pmultiqc/modules/fragpipe/fragpipe.py

171-171: Do not catch blind exception: Exception

(BLE001)

181-181: Do not catch blind exception: Exception

(BLE001)

198-198: Do not catch blind exception: Exception

(BLE001)

205-205: Unpacked variable mbr_cols is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

215-215: Do not catch blind exception: Exception

(BLE001)

222-222: Unpacked variable mbr_info is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

224-224: Do not catch blind exception: Exception

(BLE001)

234-234: Do not catch blind exception: Exception

(BLE001)

247-247: Do not catch blind exception: Exception

(BLE001)

1124-1124: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

1214-1214: Loop control variable sample not used within loop body

(B007)

pmultiqc/modules/fragpipe/fragpipe_io.py

310-310: Do not catch blind exception: Exception

(BLE001)

348-348: Do not catch blind exception: Exception

(BLE001)

356-356: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

504-504: Consider moving this statement to an else block

(TRY300)

506-506: Do not catch blind exception: Exception

(BLE001)

570-570: Do not catch blind exception: Exception

(BLE001)

696-696: Do not catch blind exception: Exception

(BLE001)

817-817: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)

GitHub Check: test_tmt
GitHub Check: test_fragpipe
GitHub Check: test_maxquant_dia
GitHub Check: test_maxquant
GitHub Check: test_dia
GitHub Check: test_diann
GitHub Check: test_lfq
GitHub Check: test_proteobench
GitHub Check: test_mzid_mgf
GitHub Check: test_mzid_mzML
GitHub Check: Codacy Static Code Analysis

🔇 Additional comments (18)

pmultiqc/modules/fragpipe/fragpipe_io.py (8)
21-34: LGTM!

The REQUIRED_COLS dictionary is properly extended with the new file types (combined_protein, combined_peptide, combined_ion). The column definitions align with FragPipe output file specifications.

282-315: LGTM!

The workflow_reader function correctly parses key=value format files with proper handling for comments and empty lines. The broad exception handling (flagged by static analysis) is acceptable here as graceful degradation is the intended behavior for optional file loading.

464-508: LGTM!

The manifest_reader handles variable-column manifest files correctly with dynamic column assignment. The error handling with graceful fallback to None is appropriate for optional file loading.

547-611: LGTM with note on MBR column parsing.

The function correctly loads protein data and identifies sample intensity columns. The MBR column parsing logic at lines 598-607 uses string replacement to extract sample base names, which may not handle all naming conventions, but the fallback behavior is safe.

614-671: LGTM!

The function correctly separates sample and contaminant proteins, applies log2 transformation with proper filtering of non-positive values, and handles edge cases gracefully.

796-864: LGTM!

The function correctly reads combined_ion.tsv with robust sample column identification using both anchor-based and pattern-matching fallback approaches. Consistent with the existing ion_reader implementation.

867-917: LGTM!

The function provides two modes of operation based on available data: detailed spectral count statistics or intensity-based identification counts. The fallback approach ensures useful output even when spectral count columns are unavailable.

530-543: Bug: Incorrect column existence check in row iteration.

The checks like if 'experiment' in row test whether the string 'experiment' is a value in the row, not whether the column exists. Since you're iterating with iterrows(), row is a pandas Series where in checks values, not index labels.
🐛 Proposed fix
     for idx, row in manifest_df.iterrows():
         file_name = row.get('file_name', row.get('file_path', f'File_{idx}'))
         entry = {
             "file_name": file_name,
         }
-        if 'experiment' in row:
+        if 'experiment' in manifest_df.columns:
             entry["experiment"] = row['experiment']
-        if 'bioreplicate' in row:
+        if 'bioreplicate' in manifest_df.columns:
             entry["bioreplicate"] = row['bioreplicate']
-        if 'data_type' in row:
+        if 'data_type' in manifest_df.columns:
             entry["data_type"] = row['data_type']

         table_data[idx + 1] = entry
Likely an incorrect or invalid review comment.
pmultiqc/modules/fragpipe/fragpipe.py (7)

100-127: LGTM!

New data containers are properly initialized for all the additional FragPipe file types. The attribute grouping with comments improves code organization.

256-269: LGTM!

Parameters and experiment design tables are properly integrated with appropriate order values (1 and 2) ensuring they appear at the top of the experiment section.

380-400: LGTM!

New visualizations for protein intensity distribution, MBR contribution, and MS/MS counts are properly guarded with data availability checks before drawing.

402-416: Good defensive coding with None filtering.

Using .get("experiment") and filtering out None values from section_group_dict prevents KeyError if the experiment section is not configured.

1021-1068: LGTM!

The draw_parameters method follows the established pattern for table rendering with appropriate configuration options.

1070-1121: LGTM!

The draw_experiment_design method is well-structured with descriptive headers and appropriate helptext.

1292-1413: LGTM!

The draw_msms_counts method handles two data formats appropriately with distinct visualizations. The data availability checks ensure safe dictionary access.

pmultiqc/main.py (3)

155-161: LGTM!

The workflow file pattern registration follows the established convention and enables proper file discovery for FragPipe configuration files.

162-167: LGTM!

The fp-manifest pattern registration is appropriately specific for FragPipe manifest files.

155-174: The *.params pattern is appropriate for this use case.

The broad pattern *.params at the discovery level is intentional and safe. The fragpipe module explicitly filters discovered files to only process the exact filename fragger.params (see fragpipe_io.py line 96: if filename == "fragger.params"), preventing any false positives from being processed.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

coderabbitai · 2026-01-16T08:26:57Z

pmultiqc/modules/fragpipe/fragpipe_io.py

+    for sample in sample_cols:
+        mbr_stats[sample] = {
+            'proteins': {'msms_only': 0, 'mbr_only': 0, 'both': 0},
+            'peptides': {'msms_only': 0, 'mbr_only': 0, 'both': 0}
+        }
+
+        # Protein-level MBR stats
+        if protein_df is not None:
+            spectral_col = None
+            intensity_col = sample
+
+            # Find corresponding spectral count column
+            for col in protein_df.columns:
+                if sample.replace(' MaxLFQ Intensity', '').replace(' Intensity', '') in col:
+                    if 'spectral count' in col.lower():
+                        spectral_col = col
+                        break
+
+            if spectral_col and intensity_col in protein_df.columns:
+                has_spectral = protein_df[spectral_col] > 0
+                has_intensity = protein_df[intensity_col] > 0
+
+                msms_only = ((has_spectral) & (~has_intensity)).sum()
+                mbr_only = ((~has_spectral) & (has_intensity)).sum()
+                both = ((has_spectral) & (has_intensity)).sum()
+
+                mbr_stats[sample]['proteins'] = {
+                    'msms_only': int(msms_only),
+                    'mbr_only': int(mbr_only),
+                    'both': int(both)
+                }
+
+    return mbr_stats


⚠️ Potential issue | 🟡 Minor

Peptide-level MBR statistics are initialized but never calculated.

The function initializes mbr_stats[sample]['peptides'] with zeros at line 764, but there's no corresponding calculation for peptide-level MBR stats (only protein-level stats are computed at lines 768-791). Either implement the peptide calculation or remove the misleading initialization.

💡 Option 1: Remove unused peptide stats

mbr_stats[sample] = { - 'proteins': {'msms_only': 0, 'mbr_only': 0, 'both': 0}, - 'peptides': {'msms_only': 0, 'mbr_only': 0, 'both': 0} + 'proteins': {'msms_only': 0, 'mbr_only': 0, 'both': 0} }

Or implement peptide-level MBR calculations if this data is needed for future visualization.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for sample in sample_cols:

mbr_stats[sample] = {

'proteins': {'msms_only': 0, 'mbr_only': 0, 'both': 0},

'peptides': {'msms_only': 0, 'mbr_only': 0, 'both': 0}

}

# Protein-level MBR stats

if protein_df is not None:

spectral_col = None

intensity_col = sample

# Find corresponding spectral count column

for col in protein_df.columns:

if sample.replace(' MaxLFQ Intensity', '').replace(' Intensity', '') in col:

if 'spectral count' in col.lower():

spectral_col = col

break

if spectral_col and intensity_col in protein_df.columns:

has_spectral = protein_df[spectral_col] > 0

has_intensity = protein_df[intensity_col] > 0

msms_only = ((has_spectral) & (~has_intensity)).sum()

mbr_only = ((~has_spectral) & (has_intensity)).sum()

both = ((has_spectral) & (has_intensity)).sum()

mbr_stats[sample]['proteins'] = {

'msms_only': int(msms_only),

'mbr_only': int(mbr_only),

'both': int(both)

}

return mbr_stats

for sample in sample_cols:

mbr_stats[sample] = {

'proteins': {'msms_only': 0, 'mbr_only': 0, 'both': 0}

}

# Protein-level MBR stats

if protein_df is not None:

spectral_col = None

intensity_col = sample

# Find corresponding spectral count column

for col in protein_df.columns:

if sample.replace(' MaxLFQ Intensity', '').replace(' Intensity', '') in col:

if 'spectral count' in col.lower():

spectral_col = col

break

if spectral_col and intensity_col in protein_df.columns:

has_spectral = protein_df[spectral_col] > 0

has_intensity = protein_df[intensity_col] > 0

msms_only = ((has_spectral) & (~has_intensity)).sum()

mbr_only = ((~has_spectral) & (has_intensity)).sum()

both = ((has_spectral) & (has_intensity)).sum()

mbr_stats[sample]['proteins'] = {

'msms_only': int(msms_only),

'mbr_only': int(mbr_only),

'both': int(both)

}

return mbr_stats

🤖 Prompt for AI Agents

In `@pmultiqc/modules/fragpipe/fragpipe_io.py` around lines 761 - 793, The code initializes peptide-level entries in mbr_stats but never fills them; either remove the 'peptides' initialization or implement peptide-level MBR calculation analogous to the protein block: when a peptide dataframe (e.g., peptide_df or peptides_df) is available, locate the matching spectral count column in that dataframe (similar logic used for protein_df), set intensity_col = sample, compute has_spectral, has_intensity, then compute msms_only, mbr_only, both and assign them into mbr_stats[sample]['peptides'] as ints (mirroring the structure used for mbr_stats[sample]['proteins']).

…eports When processing PRIDE datasets containing multiple FragPipe files (e.g., psm.tsv and ion.tsv), the service was creating separate reports for each file. This fix groups FragPipe files by their experiment prefix so that related files are processed together as a single experiment. Changes: - Add FRAGPIPE_FILE_SUFFIXES constant listing all FragPipe file types - Add group_fragpipe_files() function to group files by experiment prefix - Modify PRIDE processing loop to process grouped files together - Files without prefix (psm.tsv, ion.tsv) are grouped as "fragpipe_experiment" - Files with prefix ({exp}_psm.tsv, {exp}_ion.tsv) are grouped by prefix

…s-Aq8jP Fix duplicate reports for ion and psm files

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@pmultiqc_service/app.py`:
- Around line 838-840: The current loop uses the filename stem (file_name =
os.path.splitext(os.path.basename(file_path))[0]) as the group key and assigns
groups[file_name] = [file_path], which overwrites prior entries with the same
stem; change the logic in the loop that processes non_fragpipe_files so it uses
the full filename (e.g., os.path.basename(file_path)) as the key or, if you must
keep the stem, check if groups already has that key and append file_path to the
existing list instead of replacing it (i.e., if key in groups:
groups[key].append(file_path) else: groups[key] = [file_path]); update the block
that sets file_name and assigns into groups accordingly.

🧹 Nitpick comments (1)

pmultiqc_service/app.py (1)
768-780: Group FragPipe config/manifest files with the experiment (if they’re downloaded).

Line 768–780 only covers TSV outputs. If fragpipe.workflow, fragger.params, or fp-manifest are part of the download set, they’ll be treated as standalone groups and won’t be present when the FragPipe group runs, so parameter parsing can be skipped. Consider adding these suffixes (and, if they’re global, copying them into each group).
♻️ Possible tweak
 FRAGPIPE_FILE_SUFFIXES = [
+    "fragpipe.workflow",
+    "fragger.params",
+    "fp-manifest",
     "combined_modified_peptide.tsv",
     "combined_peptide.tsv",
     "combined_protein.tsv",
     "combined_ion.tsv",
     "peptide.tsv",
     "protein.tsv",
     "psm.tsv",
     "ion.tsv",
 ]

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1819a98 and 3f7be83.

📒 Files selected for processing (1)

pmultiqc_service/app.py

🧰 Additional context used

🪛 Ruff (0.14.11)

pmultiqc_service/app.py

1421-1421: Do not catch blind exception: Exception

(BLE001)

1461-1461: Do not catch blind exception: Exception

(BLE001)

1462-1462: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)

GitHub Check: test_maxquant_dia
GitHub Check: test_maxquant
GitHub Check: test_diann
GitHub Check: test_fragpipe
GitHub Check: test_dia
GitHub Check: test_mzid_mzML
GitHub Check: test_tmt
GitHub Check: test_mzid_mgf
GitHub Check: test_lfq
GitHub Check: test_proteobench
GitHub Check: Codacy Static Code Analysis

🔇 Additional comments (1)

pmultiqc_service/app.py (1)

1351-1457: Group-based processing flow looks solid.

The per-group extraction, input-type detection, and progress reporting in Lines 1351–1457 align well with the new grouping strategy.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

coderabbitai · 2026-01-16T10:19:36Z

pmultiqc_service/app.py

+    for file_path in non_fragpipe_files:
+        file_name = os.path.splitext(os.path.basename(file_path))[0]
+        groups[file_name] = [file_path]


⚠️ Potential issue | 🟡 Minor

Avoid overwriting non‑FragPipe groups with the same stem.

Line 839–840 uses the filename stem as the group key and overwrites any prior entry; files like sample.mgf and sample.mzML would collapse into one group, dropping one file. Use the full filename and append when the key already exists.

🐛 Proposed fix

- for file_path in non_fragpipe_files: - file_name = os.path.splitext(os.path.basename(file_path))[0] - groups[file_name] = [file_path] + for file_path in non_fragpipe_files: + file_name = os.path.basename(file_path) + groups.setdefault(file_name, []).append(file_path)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for file_path in non_fragpipe_files:

file_name = os.path.splitext(os.path.basename(file_path))[0]

groups[file_name] = [file_path]

for file_path in non_fragpipe_files:

file_name = os.path.basename(file_path)

groups.setdefault(file_name, []).append(file_path)

🤖 Prompt for AI Agents

In `@pmultiqc_service/app.py` around lines 838 - 840, The current loop uses the filename stem (file_name = os.path.splitext(os.path.basename(file_path))[0]) as the group key and assigns groups[file_name] = [file_path], which overwrites prior entries with the same stem; change the logic in the loop that processes non_fragpipe_files so it uses the full filename (e.g., os.path.basename(file_path)) as the key or, if you must keep the stem, check if groups already has that key and append file_path to the existing list instead of replacing it (i.e., if key in groups: groups[key].append(file_path) else: groups[key] = [file_path]); update the block that sets file_name and assigns into groups accordingly.

Minor update

coderabbitai

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

pmultiqc/modules/common/file_utils.py (2)
156-157: Missing exception handling for rarfile errors.

The exception handler doesn't catch rarfile.BadRarFile or rarfile.Error, which would be raised for corrupt or invalid RAR files.
Proposed fix
-    except (zipfile.BadZipFile, tarfile.TarError, gzip.BadGzipFile, OSError, ValueError) as e:
+    except (zipfile.BadZipFile, tarfile.TarError, gzip.BadGzipFile, rarfile.BadRarFile, rarfile.Error, OSError, ValueError) as e:
166-168: Missing .rar in archive types list.

The is_archive_file function doesn't include .rar in the recognized archive types, creating an inconsistency with extract_archive_file which now handles RAR files.
Proposed fix
 def is_archive_file(file_path: str) -> bool:
-    archive_types = [".zip", ".gz", ".tar", ".tar.gz", ".tgz", ".tar.bz2"]
+    archive_types = [".zip", ".gz", ".tar", ".tar.gz", ".tgz", ".tar.bz2", ".rar"]
     return any(file_path.lower().endswith(arch_type) for arch_type in archive_types)
pmultiqc/modules/maxquant/maxquant_plots.py (1)
446-479: Guard unknown data_type to avoid UnboundLocalError.

description_text / help_text are only set in the two explicit branches, so any other value (including case mismatches) will crash at add_sub_section. Initialize defaults and normalize data_type defensively.
🛠️ Proposed fix
-    if data_type == "maxquant":
+    data_type_norm = (data_type or "").lower()
+    description_text = ""
+    help_text = ""
+    if data_type_norm == "maxquant":
         description_text = """
             [Excludes Contaminants] Number of unique (i.e. not counted twice) peptide sequences including modifications (after FDR) per Raw file.
             """
         help_text="""
             If MBR was enabled, three categories ('Genuine (Exclusive)', 'Genuine + Transferred', 'Transferred (Exclusive)'
             are shown, so the user can judge the gain that MBR provides.
             ...
             """
-    elif data_type == "fragpipe":
+    elif data_type_norm == "fragpipe":
         description_text = """
             combined_peptide.tsv
             """
         help_text = """
             combined_peptide.tsv
             """
+    else:
+        description_text = "[Excludes Contaminants] Number of unique peptide sequences per file."
+        help_text = "Counts after FDR filtering."

🤖 Fix all issues with AI agents

In `@pmultiqc/modules/common/file_utils.py`:
- Around line 128-131: The extract_rar function currently calls
rarfile.RarFile(...).extractall without the same security checks as
extract_zip/extract_tar; update extract_rar to mirror those protections by
iterating through rf.infolist()/rf.namelist(), computing the resolved target
path for each member, and ensuring the resolved path is inside the provided
root_dir (preventing path traversal), disallowing/sanitizing absolute paths and
symlinks, and enforcing archive-bomb limits (e.g., max total uncompressed bytes
and max file count) before extraction; if any check fails, raise an exception
and do not extract. Use the same helper logic/function names used by
extract_zip/extract_tar (reuse or extract shared helpers) to keep behavior
consistent.

In `@pmultiqc/modules/fragpipe/fragpipe_io.py`:
- Around line 846-870: validate_columns_existence currently mutates the shared
REQUIRED_KEYWORDS entry for data_name which causes stale state across calls; to
fix, create a fresh per-call copy of the required keywords at the start of
validate_columns_existence (e.g., local_required =
REQUIRED_KEYWORDS[data_name].copy() or use dict(...) / deepcopy as appropriate)
and then use local_required for all mutations and the final
all(local_required.values()) and reporting, leaving the global REQUIRED_KEYWORDS
unchanged.
- Around line 930-969: In cal_peptide_id_gain, guard against division by zero
when computing count_df["MBRgain"]: replace the current raw division using
count_df["mbr"] / count_df["ms/ms_count"] with a safe computation that avoids
inf/NaN (e.g., compute per-row MBRgain only when "ms/ms_count" > 0, otherwise
set NaN or 0), then sanitize infinite values (np.inf) to NaN and compute
mbr_gain using .dropna().mean() before rounding; finally build title_value only
when mbr_gain is a finite number. Reference: function cal_peptide_id_gain,
DataFrame count_df, column "ms/ms_count", "mbr", and the computed "MBRgain".

♻️ Duplicate comments (1)

pmultiqc/modules/fragpipe/fragpipe_io.py (1)
769-799: Peptide-level MBR stats are initialized but never populated.

mbr_stats[sample]["peptides"] stays zeroed because no peptide-level calculation exists. Either compute it or drop the field to avoid misleading output.
🧹 Minimal fix (remove unused peptides key)
-        mbr_stats[sample] = {
-            'proteins': {'msms_only': 0, 'mbr_only': 0, 'both': 0},
-            'peptides': {'msms_only': 0, 'mbr_only': 0, 'both': 0}
-        }
+        mbr_stats[sample] = {
+            'proteins': {'msms_only': 0, 'mbr_only': 0, 'both': 0}
+        }

🧹 Nitpick comments (1)

.github/workflows/python-app.yml (1)

272-281: Code changes are correct; consider documenting the unrar dependency for local users.

The workflow correctly installs unrar and the code uses it properly. However, rarfile requires an external decompressor (unrar, unar, 7z, or bsdtar) to be installed on the system. While the CI handles this, users installing locally without this dependency will encounter errors. The current setup instructions in CONTRIBUTING.md don't mention this external tool requirement—adding a note would improve the developer experience.

pmultiqc/modules/common/file_utils.py

coderabbitai · 2026-01-18T07:31:53Z

pmultiqc/modules/fragpipe/fragpipe_io.py

+def validate_columns_existence(df_columns, data_name: str):
+
+    col_list = list(df_columns)
+
+    required_keywords = REQUIRED_KEYWORDS[data_name]
+
+    for col in col_list:
+        if "Spectral Count" in col:
+            required_keywords["Spectral Count"] = True
+        if "Match Type" in col:
+            required_keywords["Match Type"] = True
+        if "Intensity" in col:
+            if "MaxLFQ" not in col:
+                required_keywords["Intensity"] = True
+
+    all_passed = all(required_keywords.values())
+
+    print(f"Check whether the data {data_name} meets the extraction requirements.")
+
+
+    for key, found in required_keywords.items():
+        status = "exists" if found else "is missing"
+        print(f"{key}: {status}")
+
+    return all_passed


⚠️ Potential issue | 🟠 Major

Reset REQUIRED_KEYWORDS per call to avoid stale validation.

validate_columns_existence mutates the shared REQUIRED_KEYWORDS dict, so later calls can incorrectly pass even if required columns are missing. Use a fresh dict per call.

🛠️ Proposed fix

- required_keywords = REQUIRED_KEYWORDS[data_name] + required_keywords = {k: False for k in REQUIRED_KEYWORDS[data_name].keys()}

🤖 Prompt for AI Agents

In `@pmultiqc/modules/fragpipe/fragpipe_io.py` around lines 846 - 870, validate_columns_existence currently mutates the shared REQUIRED_KEYWORDS entry for data_name which causes stale state across calls; to fix, create a fresh per-call copy of the required keywords at the start of validate_columns_existence (e.g., local_required = REQUIRED_KEYWORDS[data_name].copy() or use dict(...) / deepcopy as appropriate) and then use local_required for all mutations and the final all(local_required.values()) and reporting, leaving the global REQUIRED_KEYWORDS unchanged.

pmultiqc/modules/fragpipe/fragpipe_io.py

The extract_rar function was missing security validations present in extract_zip and extract_tar. This adds: - Path traversal detection (../ patterns, absolute paths) - Archive bomb detection (compression ratio > 100:1 or > 10GB uncompressed) - Proper type hints and logging - rarfile.Error to the exception handling in extract_archive_file

…y-BCoTR Fix security issue in RAR file extraction

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

pmultiqc/modules/common/file_utils.py (1)
201-203: Missing .rar in archive type detection.

The is_archive_file function doesn't include .rar in archive_types, but extract_archive_file supports RAR extraction. When is_archive_file is used in main.py:66 to determine whether to extract submitted files, RAR archives will not be detected and won't be extracted.
🔧 Proposed fix
 def is_archive_file(file_path: str) -> bool:
-    archive_types = [".zip", ".gz", ".tar", ".tar.gz", ".tgz", ".tar.bz2"]
+    archive_types = [".zip", ".gz", ".tar", ".tar.gz", ".tgz", ".tar.bz2", ".rar"]
     return any(file_path.lower().endswith(arch_type) for arch_type in archive_types)

fix

codecov · 2026-01-18T08:21:16Z

Codecov Report

❌ Patch coverage is 5.62771% with 436 lines in your changes missing coverage. Please review.
✅ Project coverage is 8.39%. Comparing base (a5ab34a) to head (11bd7ae).
⚠️ Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
pmultiqc/modules/fragpipe/fragpipe_io.py	4.82%	276 Missing ⚠️
pmultiqc/modules/fragpipe/fragpipe.py	6.80%	137 Missing ⚠️
pmultiqc/modules/common/plots/id.py	10.00%	9 Missing ⚠️
pmultiqc/modules/maxquant/maxquant_plots.py	12.50%	7 Missing ⚠️
pmultiqc/main.py	0.00%	6 Missing ⚠️
pmultiqc/modules/common/file_utils.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##            main    #544      +/-   ##
========================================
- Coverage   8.63%   8.39%   -0.25%     
========================================
  Files         44      44              
  Lines       5793    6233     +440     
========================================
+ Hits         500     523      +23     
- Misses      5293    5710     +417

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)

pmultiqc/modules/common/file_utils.py (1)
127-146: Remove undefined rarfile.Error from exception handling.

Line 145 references rarfile.Error in the except clause, but rarfile is not imported and is not declared as a dependency anywhere in the repository. This causes an undefined name error (F821) and will raise a NameError at runtime. Remove rarfile.Error from the exception tuple unless RAR support is being actively implemented with the dependency properly declared and imported.
-    except (zipfile.BadZipFile, tarfile.TarError, gzip.BadGzipFile, rarfile.Error, OSError, ValueError) as e:
+    except (zipfile.BadZipFile, tarfile.TarError, gzip.BadGzipFile, OSError, ValueError) as e:
         raise ValueError(f"Failed to extract {file_path}: {e}") from e
pmultiqc/modules/fragpipe/fragpipe_io.py (1)
174-189: Convert object-dtype sample intensity columns to numeric in-place.

Line 184 validates that object columns can be converted to numeric but does not apply the conversion, leaving them as strings in the returned dataframe. Downstream at line 231, get_ion_intensity_data() performs intensities[intensities > 0] comparisons, which will raise TypeError or behave unexpectedly on string-typed columns.

Assign the result of pd.to_numeric() back to the dataframe column and verify the conversion produced valid numeric values:
🔧 Proposed fix
-        elif ion_df[col].dtype == object:
-            # Try to convert to numeric
-            try:
-                pd.to_numeric(ion_df[col], errors='raise')
-                valid_sample_cols.append(col)
-            except (ValueError, TypeError):
-                pass
+        elif ion_df[col].dtype == object:
+            # Try to convert to numeric
+            try:
+                ion_df[col] = pd.to_numeric(ion_df[col], errors='coerce')
+                if ion_df[col].notna().any():
+                    valid_sample_cols.append(col)
+            except (ValueError, TypeError):
+                pass

🤖 Fix all issues with AI agents

In `@pmultiqc/modules/fragpipe/fragpipe_io.py`:
- Around line 675-689: The current check uses pd.to_numeric(sample_df[col],
errors='coerce').notna().all() which skips any column that has a single
non-numeric/missing value; instead coerce to numeric and skip only if the entire
column becomes empty. Replace the all() check with logic that converts the
column to numeric, drops NA (e.g., series_numeric =
pd.to_numeric(sample_df[col], errors='coerce').dropna()), continue only when
series_numeric is empty, and then use series_numeric (not the original
sample_df[col]) to compute intensities and contaminant handling for
sample_distribution and contaminant_distribution so partial missingness does not
remove valid data.
- Around line 576-583: Docstrings for combined_protein_reader,
combined_peptide_reader, and combined_ion_reader are out of sync with their
actual return values: each function returns two items (e.g., (protein_df,
sample_intensity_cols) or (peptide_df, sample_intensity_cols) or (ion_df,
validate_columns)) but the docstrings still describe a third item (mbr_cols /
mbr_info / sample_cols). Update each function's Returns section to list only the
two actual returned items and their meanings (remove
mbr_cols/mbr_info/sample_cols descriptions), explicitly name the returned
variables as used in the code (e.g., protein_df and sample_intensity_cols for
combined_protein_reader; peptide_df and sample_intensity_cols for
combined_peptide_reader; ion_df and validate_columns for combined_ion_reader) so
the documentation matches the implementation and call sites.

♻️ Duplicate comments (2)

pmultiqc/modules/fragpipe/fragpipe_io.py (2)
769-799: Peptide MBR stats are initialized but never computed.

Line 771–798 initializes mbr_stats[sample]['peptides'] but no peptide-level calculation is performed. Either compute peptide MBR stats or remove the placeholder to avoid misleading output.

846-861: Avoid mutating global REQUIRED_KEYWORDS.

Line 850: required_keywords = REQUIRED_KEYWORDS[data_name] mutates the shared dict and can cause stale state across calls (later validations can pass incorrectly). Use a per-call copy.
🔧 Proposed fix
-    required_keywords = REQUIRED_KEYWORDS[data_name]
+    required_keywords = {k: False for k in REQUIRED_KEYWORDS[data_name].keys()}

coderabbitai · 2026-01-18T08:25:53Z

pmultiqc/modules/fragpipe/fragpipe_io.py

+    Returns
+    -------
+    tuple
+        (protein_df, sample_intensity_cols, mbr_cols) where:
+        - protein_df: DataFrame with protein-level data
+        - sample_intensity_cols: List of sample intensity column names
+        - mbr_cols: Dict mapping sample to MBR-related columns
+    """


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First, inspect the file to understand structure wc -l pmultiqc/modules/fragpipe/fragpipe_io.py # Find the combined_*_reader functions and their return statements rg -nP '\bcombined_(protein|peptide|ion)_reader' pmultiqc/modules/fragpipe/fragpipe_io.py -A 5 -B 2

Repository: bigbio/pmultiqc

Length of output: 174

🏁 Script executed:

#!/bin/bash # Inspect specific lines for docstrings and returns echo "=== Lines 576-583 (combined_protein_reader docstring) ===" sed -n '576,583p' pmultiqc/modules/fragpipe/fragpipe_io.py echo "=== Lines 625-635 (combined_protein_reader return) ===" sed -n '625,635p' pmultiqc/modules/fragpipe/fragpipe_io.py echo "=== Lines 706-735 (combined_peptide_reader docstring) ===" sed -n '706,735p' pmultiqc/modules/fragpipe/fragpipe_io.py echo "=== Lines 818-844 (combined_ion_reader docstring) ===" sed -n '818,844p' pmultiqc/modules/fragpipe/fragpipe_io.py

Repository: bigbio/pmultiqc

Length of output: 2513

🏁 Script executed:

#!/bin/bash # Find all call sites of combined_*_reader functions echo "=== Call sites ===" rg -nP '\bcombined_(protein|peptide|ion)_reader\s*\(' pmultiqc/modules/fragpipe/fragpipe_io.py -C 3

Repository: bigbio/pmultiqc

Length of output: 598

🏁 Script executed:

#!/bin/bash # Search for function calls (not definitions) echo "=== Calls to combined_protein_reader ===" rg 'combined_protein_reader' pmultiqc/modules/fragpipe/fragpipe_io.py -n | grep -v 'def combined_protein_reader' echo "=== Calls to combined_peptide_reader ===" rg 'combined_peptide_reader' pmultiqc/modules/fragpipe/fragpipe_io.py -n | grep -v 'def combined_peptide_reader' echo "=== Calls to combined_ion_reader ===" rg 'combined_ion_reader' pmultiqc/modules/fragpipe/fragpipe_io.py -n | grep -v 'def combined_ion_reader'

Repository: bigbio/pmultiqc

Length of output: 381

🏁 Script executed:

#!/bin/bash # Search across entire fragpipe module for calls echo "=== All calls to combined_*_reader ===" rg 'combined_(protein|peptide|ion)_reader' pmultiqc/modules/fragpipe/ -n | grep -v 'def combined_'

Repository: bigbio/pmultiqc

Length of output: 1096

🏁 Script executed:

#!/bin/bash # Get the exact return statements for each function echo "=== combined_protein_reader function definition and return ===" sed -n '563,630p' pmultiqc/modules/fragpipe/fragpipe_io.py | tail -20 echo "=== combined_peptide_reader function definition and return ===" sed -n '694,730p' pmultiqc/modules/fragpipe/fragpipe_io.py | tail -20 echo "=== combined_ion_reader function definition and return ===" sed -n '804,845p' pmultiqc/modules/fragpipe/fragpipe_io.py | tail -20

Repository: bigbio/pmultiqc

Length of output: 2242

🏁 Script executed:

#!/bin/bash # Check the fragpipe.py file to see the exact unpacking at call sites echo "=== fragpipe.py line 202 context ===" sed -n '200,205p' pmultiqc/modules/fragpipe/fragpipe.py echo "=== fragpipe.py line 219 context ===" sed -n '217,222p' pmultiqc/modules/fragpipe/fragpipe.py echo "=== fragpipe.py line 233 context ===" sed -n '231,236p' pmultiqc/modules/fragpipe/fragpipe.py

Repository: bigbio/pmultiqc

Length of output: 1209

Fix docstring return contracts for combined_*_reader functions.

Docstrings incorrectly document return values:

combined_protein_reader docstring claims 3 returns (including mbr_cols) but only returns 2

combined_peptide_reader docstring claims 3 returns (including mbr_info) but only returns 2

combined_ion_reader docstring documents sample_cols but actually returns validate_columns

Call sites in fragpipe.py correctly unpack 2 values, so no runtime errors occur. Update docstrings to match actual returns.

🤖 Prompt for AI Agents

In `@pmultiqc/modules/fragpipe/fragpipe_io.py` around lines 576 - 583, Docstrings for combined_protein_reader, combined_peptide_reader, and combined_ion_reader are out of sync with their actual return values: each function returns two items (e.g., (protein_df, sample_intensity_cols) or (peptide_df, sample_intensity_cols) or (ion_df, validate_columns)) but the docstrings still describe a third item (mbr_cols / mbr_info / sample_cols). Update each function's Returns section to list only the two actual returned items and their meanings (remove mbr_cols/mbr_info/sample_cols descriptions), explicitly name the returned variables as used in the code (e.g., protein_df and sample_intensity_cols for combined_protein_reader; peptide_df and sample_intensity_cols for combined_peptide_reader; ion_df and validate_columns for combined_ion_reader) so the documentation matches the implementation and call sites.

coderabbitai · 2026-01-18T08:25:53Z

pmultiqc/modules/fragpipe/fragpipe_io.py

+        if not pd.to_numeric(sample_df[col], errors='coerce').notna().all():
+            continue
+
+        # Sample intensities
+        intensities = sample_df[col].dropna()
+        valid_intensities = intensities[intensities > 0]
+        if len(valid_intensities) > 0:
+            sample_distribution[col] = np.log2(valid_intensities).tolist()
+
+        # Contaminant intensities
+        if not contam_df.empty and col in contam_df.columns:
+            cont_intensities = contam_df[col].dropna()
+            valid_cont = cont_intensities[cont_intensities > 0]
+            if len(valid_cont) > 0:
+                contaminant_distribution[col] = np.log2(valid_cont).tolist()


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check if the file exists and read the relevant lines if [ -f "pmultiqc/modules/fragpipe/fragpipe_io.py" ]; then echo "=== File found. Reading lines 670-695 for context ===" sed -n '670,695p' pmultiqc/modules/fragpipe/fragpipe_io.py | cat -n else echo "File not found at expected location. Searching for it..." find . -name "fragpipe_io.py" -type f fi

Repository: bigbio/pmultiqc

Length of output: 1172

Avoid dropping entire columns when any single intensity value is missing.

Line 675–676: The check notna().all() skips columns if any value is missing or non-numeric. This is overly strict and silently removes valid samples from QC plots—common in real proteomics datasets with partial missingness. Only skip columns that are completely empty after numeric coercion.

🔧 Proposed fix

- if not pd.to_numeric(sample_df[col], errors='coerce').notna().all(): - continue - - # Sample intensities - intensities = sample_df[col].dropna() + values = pd.to_numeric(sample_df[col], errors='coerce') + if values.notna().sum() == 0: + continue + + # Sample intensities + intensities = values.dropna() valid_intensities = intensities[intensities > 0] if len(valid_intensities) > 0: sample_distribution[col] = np.log2(valid_intensities).tolist() @@ - cont_intensities = contam_df[col].dropna() + cont_intensities = pd.to_numeric(contam_df[col], errors='coerce').dropna() valid_cont = cont_intensities[cont_intensities > 0]

🤖 Prompt for AI Agents

In `@pmultiqc/modules/fragpipe/fragpipe_io.py` around lines 675 - 689, The current check uses pd.to_numeric(sample_df[col], errors='coerce').notna().all() which skips any column that has a single non-numeric/missing value; instead coerce to numeric and skip only if the entire column becomes empty. Replace the all() check with logic that converts the column to numeric, drops NA (e.g., series_numeric = pd.to_numeric(sample_df[col], errors='coerce').dropna()), continue only when series_numeric is empty, and then use series_numeric (not the original sample_df[col]) to compute intensities and contaminant handling for sample_distribution and contaminant_distribution so partial missingness does not remove valid data.

Fix

coderabbitai

Actionable comments posted: 3

🤖 Fix all issues with AI agents

In `@pmultiqc/modules/fragpipe/fragpipe.py`:
- Around line 229-240: The success log "Combined ion data loaded successfully."
is emitted before you verify combined_ion_df and combined_ion_column_valid; move
or change the log so it only runs after the validation check. In the try block
call combined_ion_reader(combined_ion_path) as you do, then if combined_ion_df
is not None and combined_ion_column_valid set self.msms_counts via
get_msms_counts_per_peak and call log.info("Combined ion data loaded
successfully.") there; otherwise emit a warning or skip logging. Update
references to combined_ion_reader, combined_ion_df, combined_ion_column_valid,
get_msms_counts_per_peak, self.msms_counts, and log.info accordingly.
- Around line 215-227: The success log is printed before validating the parsed
DataFrame and contains a typo; move the "Combined peptide data loaded
successfully." log so it runs only after confirming peptide_df is not None and
combined_peptide_column_valid, change the other log message in the same block
from "peptide statics loaded successfully." to "peptide statistics loaded
successfully.", and ensure these messages are emitted around the
combined_peptide_reader call and the cal_peptide_id_gain assignment (referencing
combined_peptide_reader, peptide_df, combined_peptide_column_valid,
cal_peptide_id_gain, and self.peptide_id_gain) while keeping the existing
exception log (log.warning) intact.
- Around line 390-404: Either remove the unused get_mbr_stats import and the
commented draw_mbr_contribution block, or populate self.mbr_stats in get_data()
by calling get_mbr_stats(...) with the appropriate protein/peptide DataFrames
and then enable the visualization; specifically, locate the get_data() method in
fragpipe.py and set self.mbr_stats = get_mbr_stats(protein_df, peptide_df) (or
similar variable names used there), ensure get_mbr_stats is imported, and then
uncomment/restore the draw_mbr_contribution(...) call that references
self.mbr_stats and draw_mbr_contribution/sub_sections["identification"] so the
MBR visualization has valid data.

🧹 Nitpick comments (4)

pmultiqc/modules/fragpipe/fragpipe.py (4)
161-185: Consider catching more specific exceptions.

The generic Exception catches work for graceful fallback, but catching specific exceptions (e.g., FileNotFoundError, ValueError, KeyError) would improve debuggability and avoid masking unexpected errors.
♻️ Example for workflow parsing
         if self.fragpipe_files.get("workflow"):
             try:
                 workflow_path = self.fragpipe_files["workflow"][0]
                 self.parameters = workflow_reader(workflow_path)
                 if self.parameters:
                     log.info("Workflow parameters loaded successfully.")
-            except Exception as e:
+            except (FileNotFoundError, IOError, ValueError) as e:
                 log.warning(f"Error parsing workflow file: {e}")
198-213: Use defensive access for contaminant_affix config.

Direct dictionary access config.kwargs["contaminant_affix"] will raise KeyError if the config is missing. Consider using .get() with a default value for robustness.
♻️ Proposed fix
                 if self.protein_df is not None and self.protein_intensity_cols:
-                    contam_affix = config.kwargs["contaminant_affix"]
+                    contam_affix = config.kwargs.get("contaminant_affix", "CONT")
                     (
                         self.protein_intensity_distribution,
                         self.protein_contam_distribution
1152-1165: Fix implicit Optional type hint.

Per PEP 484, use explicit Optional[dict] or dict | None instead of dict = None.
♻️ Proposed fix
     `@staticmethod`
-    def draw_protein_intensity_distribution(sub_section, sample_distribution: dict, contam_distribution: dict = None):
+    def draw_protein_intensity_distribution(sub_section, sample_distribution: dict, contam_distribution: dict | None = None):
1241-1251: Use _ for unused loop variable.

The loop variable sample at line 1243 is not used in the loop body.
♻️ Proposed fix
         has_data = False
-        for sample, stats in mbr_stats.items():
+        for _, stats in mbr_stats.items():
             proteins = stats.get('proteins', {})
             if proteins.get('mbr_only', 0) > 0 or proteins.get('both', 0) > 0:
                 has_data = True
                 break

coderabbitai · 2026-01-18T08:48:24Z

pmultiqc/modules/fragpipe/fragpipe.py

+        # Parse combined_peptide.tsv
+        if self.fragpipe_files.get("combined_peptide"):
+            try:
+                peptide_path = self.fragpipe_files["combined_peptide"][0]
+                peptide_df, combined_peptide_column_valid = combined_peptide_reader(peptide_path)
+                log.info("Combined peptide data loaded successfully.")
+
+                if peptide_df is not None and combined_peptide_column_valid:
+                    self.peptide_id_gain = cal_peptide_id_gain(peptide_df)
+                    log.info("peptide statics loaded successfully.")
+
+            except Exception as e:
+                log.warning(f"Error parsing combined_peptide.tsv: {e}")


⚠️ Potential issue | 🟡 Minor

Fix log message placement and typo.

The success log at line 220 fires before validation, and line 224 has a typo ("statics" → "statistics").

🐛 Proposed fix

if self.fragpipe_files.get("combined_peptide"): try: peptide_path = self.fragpipe_files["combined_peptide"][0] peptide_df, combined_peptide_column_valid = combined_peptide_reader(peptide_path) - log.info("Combined peptide data loaded successfully.") if peptide_df is not None and combined_peptide_column_valid: + log.info("Combined peptide data loaded successfully.") self.peptide_id_gain = cal_peptide_id_gain(peptide_df) - log.info("peptide statics loaded successfully.") + log.info("Peptide statistics loaded successfully.") except Exception as e: log.warning(f"Error parsing combined_peptide.tsv: {e}")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Parse combined_peptide.tsv

if self.fragpipe_files.get("combined_peptide"):

try:

peptide_path = self.fragpipe_files["combined_peptide"][0]

peptide_df, combined_peptide_column_valid = combined_peptide_reader(peptide_path)

log.info("Combined peptide data loaded successfully.")

if peptide_df is not None and combined_peptide_column_valid:

self.peptide_id_gain = cal_peptide_id_gain(peptide_df)

log.info("peptide statics loaded successfully.")

except Exception as e:

log.warning(f"Error parsing combined_peptide.tsv: {e}")

# Parse combined_peptide.tsv

if self.fragpipe_files.get("combined_peptide"):

try:

peptide_path = self.fragpipe_files["combined_peptide"][0]

peptide_df, combined_peptide_column_valid = combined_peptide_reader(peptide_path)

if peptide_df is not None and combined_peptide_column_valid:

log.info("Combined peptide data loaded successfully.")

self.peptide_id_gain = cal_peptide_id_gain(peptide_df)

log.info("Peptide statistics loaded successfully.")

except Exception as e:

log.warning(f"Error parsing combined_peptide.tsv: {e}")

🧰 Tools

🪛 Ruff (0.14.13)

226-226: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

In `@pmultiqc/modules/fragpipe/fragpipe.py` around lines 215 - 227, The success log is printed before validating the parsed DataFrame and contains a typo; move the "Combined peptide data loaded successfully." log so it runs only after confirming peptide_df is not None and combined_peptide_column_valid, change the other log message in the same block from "peptide statics loaded successfully." to "peptide statistics loaded successfully.", and ensure these messages are emitted around the combined_peptide_reader call and the cal_peptide_id_gain assignment (referencing combined_peptide_reader, peptide_df, combined_peptide_column_valid, cal_peptide_id_gain, and self.peptide_id_gain) while keeping the existing exception log (log.warning) intact.

coderabbitai · 2026-01-18T08:48:24Z

pmultiqc/modules/fragpipe/fragpipe.py

+        # Parse combined_ion.tsv for MS/MS counts
+        if self.fragpipe_files.get("combined_ion"):
+            try:
+                combined_ion_path = self.fragpipe_files["combined_ion"][0]
+                combined_ion_df, combined_ion_column_valid = combined_ion_reader(combined_ion_path)
+                log.info("Combined ion data loaded successfully.")
+
+                if combined_ion_df is not None and combined_ion_column_valid:
+                    self.msms_counts = get_msms_counts_per_peak(combined_ion_df)
+                    log.info("MS/MS counts per peak loaded successfully.")
+            except Exception as e:
+                log.warning(f"Error parsing combined_ion.tsv: {e}")


⚠️ Potential issue | 🟡 Minor

Log message fires before validation check.

Similar to the peptide section, the success log at line 234 fires before the validity check at line 236.

🐛 Proposed fix

if self.fragpipe_files.get("combined_ion"): try: combined_ion_path = self.fragpipe_files["combined_ion"][0] combined_ion_df, combined_ion_column_valid = combined_ion_reader(combined_ion_path) - log.info("Combined ion data loaded successfully.") if combined_ion_df is not None and combined_ion_column_valid: + log.info("Combined ion data loaded successfully.") self.msms_counts = get_msms_counts_per_peak(combined_ion_df) log.info("MS/MS counts per peak loaded successfully.")

🧰 Tools

🪛 Ruff (0.14.13)

239-239: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

In `@pmultiqc/modules/fragpipe/fragpipe.py` around lines 229 - 240, The success log "Combined ion data loaded successfully." is emitted before you verify combined_ion_df and combined_ion_column_valid; move or change the log so it only runs after the validation check. In the try block call combined_ion_reader(combined_ion_path) as you do, then if combined_ion_df is not None and combined_ion_column_valid set self.msms_counts via get_msms_counts_per_peak and call log.info("Combined ion data loaded successfully.") there; otherwise emit a warning or skip logging. Update references to combined_ion_reader, combined_ion_df, combined_ion_column_valid, get_msms_counts_per_peak, self.msms_counts, and log.info accordingly.

coderabbitai · 2026-01-18T08:48:25Z

pmultiqc/modules/fragpipe/fragpipe.py

+        # MBR (Match-Between-Runs) visualization
+        # if self.mbr_stats:
+        #     self.draw_mbr_contribution(
+        #         sub_section=self.sub_sections["identification"],
+        #         mbr_stats=self.mbr_stats
+        #     )
+
+        # MS/MS counts per peak from combined_ion.tsv
+        if self.msms_counts:
+            draw_oversampling(
+                self.sub_sections["ms2"],
+                self.msms_counts,
+                "",
+                "fragpipe",
+            )


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

rg -n "mbr_stats" --type py

Repository: bigbio/pmultiqc

Length of output: 1291

🏁 Script executed:

# Check the fragpipe.py file for imports and get_mbr_stats usage rg -n "get_mbr_stats" --type py

Repository: bigbio/pmultiqc

Length of output: 219

🏁 Script executed:

# Get the fragpipe.py file to check imports at the top head -50 pmultiqc/modules/fragpipe/fragpipe.py | cat -n

Repository: bigbio/pmultiqc

Length of output: 1679

🏁 Script executed:

# Check the get_data method in fragpipe.py rg -A 100 "def get_data" pmultiqc/modules/fragpipe/fragpipe.py | head -150

Repository: bigbio/pmultiqc

Length of output: 4553

🏁 Script executed:

# Check if get_mbr_stats is called anywhere (not just imported) rg -B 2 -A 5 "get_mbr_stats\(" pmultiqc/modules/fragpipe/fragpipe.py

Repository: bigbio/pmultiqc

Length of output: 41

🏁 Script executed:

# Search for self.mbr_stats assignment rg -n "self\.mbr_stats\s*=" pmultiqc/modules/fragpipe/fragpipe.py

Repository: bigbio/pmultiqc

Length of output: 41

🏁 Script executed:

# Get more of the get_data method to see if mbr_stats is populated sed -n '130,250p' pmultiqc/modules/fragpipe/fragpipe.py | cat -n

Repository: bigbio/pmultiqc

Length of output: 6305

Remove unused MBR import or implement missing data loading.

The get_mbr_stats function is imported but never called, and self.mbr_stats is never populated in get_data(). The commented code at lines 391–395 references this undefined attribute. Either remove the dead import and commented visualization code, or complete the implementation by calling get_mbr_stats() with the protein and peptide dataframes to populate self.mbr_stats.

🤖 Prompt for AI Agents

In `@pmultiqc/modules/fragpipe/fragpipe.py` around lines 390 - 404, Either remove the unused get_mbr_stats import and the commented draw_mbr_contribution block, or populate self.mbr_stats in get_data() by calling get_mbr_stats(...) with the appropriate protein/peptide DataFrames and then enable the visualization; specifically, locate the get_data() method in fragpipe.py and set self.mbr_stats = get_mbr_stats(protein_df, peptide_df) (or similar variable names used there), ensure get_mbr_stats is imported, and then uncomment/restore the draw_mbr_contribution(...) call that references self.mbr_stats and draw_mbr_contribution/sub_sections["identification"] so the MBR visualization has valid data.

claude and others added 3 commits January 16, 2026 07:18

Merge pull request #543 from bigbio/claude/improve-fragpipe-support-Z…

1819a98

…ALIA Improve FragPipe support and file handling

coderabbitai bot reviewed Jan 16, 2026

View reviewed changes

claude and others added 2 commits January 16, 2026 10:04

Merge pull request #545 from bigbio/claude/fix-duplicate-pride-report…

3f7be83

…s-Aq8jP Fix duplicate reports for ion and psm files

coderabbitai bot reviewed Jan 16, 2026

View reviewed changes

yueqixuan and others added 5 commits January 18, 2026 14:29

fix

466120d

fix

5efefb6

fix

f538c3e

fix

f5adafe

Merge pull request #547 from yueqixuan/dev

0ed137f

Minor update

coderabbitai bot reviewed Jan 18, 2026

View reviewed changes

claude and others added 2 commits January 18, 2026 07:53

Merge pull request #548 from bigbio/claude/fix-rar-extraction-securit…

6e14f4e

…y-BCoTR Fix security issue in RAR file extraction

coderabbitai bot reviewed Jan 18, 2026

View reviewed changes

yueqixuan and others added 3 commits January 18, 2026 16:05

fix

2d4971c

Merge branch 'dev' into dev

47875d1

Merge pull request #549 from yueqixuan/dev

11bd7ae

fix

coderabbitai bot reviewed Jan 18, 2026

View reviewed changes

yueqixuan and others added 4 commits January 18, 2026 16:26

fix

1c3be15

fix

0471efc

fix

1d9a169

Merge pull request #550 from yueqixuan/dev

6e48185

Fix

coderabbitai bot reviewed Jan 18, 2026

View reviewed changes

ypriverol merged commit 5fd836f into main Jan 18, 2026
22 of 23 checks passed

coderabbitai bot mentioned this pull request Jan 19, 2026

pmultiqc service update example for FragPipe #564

Merged

8 tasks

Conversation

ypriverol commented Jan 16, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request

Description

Type of Change

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jan 18, 2026

Codecov Report

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ypriverol commented Jan 16, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 16, 2026 •

edited

Loading