Skip to content

Improving FragPipe#544

Merged
ypriverol merged 19 commits intomainfrom
dev
Jan 18, 2026
Merged

Improving FragPipe#544
ypriverol merged 19 commits intomainfrom
dev

Conversation

@ypriverol
Copy link
Member

@ypriverol ypriverol commented Jan 16, 2026

Pull Request

Description

Brief description of the changes made in this PR.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Performance improvement
  • Code refactoring
  • Test addition/update
  • Updates to the dependencies has been done.

Summary by CodeRabbit

  • New Features

    • Significantly expanded FragPipe module with new visualizations: parameters table, experiment design table, protein intensity distribution with contaminant tracking, and MS/MS counts per peak analysis.
    • Automatic intelligent grouping of FragPipe output files by experiment for streamlined processing.
  • Documentation

    • Updated example reports with new FragPipe reference dataset.

✏️ Tip: You can customize this high-level summary in your review settings.

claude and others added 3 commits January 16, 2026 07:18
This commit addresses issue #541 by adding comprehensive FragPipe QC support
similar to MaxQuant. New features include:

**Parameters Section:**
- Parse fragpipe.workflow file for configuration parameters
- Display key settings (enzyme, mass tolerances, modifications, MBR, etc.)
- Added search pattern for *.workflow files

**Experiment Design:**
- Parse fp-manifest file for experiment design information
- Display file names, experiments, bioreplicates, and data types
- Added search pattern for *.fp-manifest files

**Protein Intensity Distribution:**
- Parse combined_protein.tsv for protein-level quantification
- Generate boxplot of log2-transformed protein intensities
- Separate visualization for sample vs contaminant proteins

**Match-Between-Runs (MBR) Visualization:**
- Analyze combined_protein.tsv and combined_peptide.tsv for MBR contribution
- Display stacked bar chart showing MS/MS only, MBR only, and both categories
- Helps assess MBR quality and potential false positives

**MS/MS Counts Per Peak:**
- Parse combined_ion.tsv for MS/MS spectral counts
- Generate boxplot of spectral count distribution per sample
- Alternative bar chart view for identification status

All new features use graceful fallback - missing files don't prevent report
generation. This follows the same pattern used by MaxQuant and DIA-NN modules.

Resolves: #541
Add parsing of MSFragger's fragger.params file for search engine parameters.
This provides additional QC information and serves as a fallback when
fragpipe.workflow is not available.

Changes:
- Add fragger_params_reader() function to parse fragger.params files
- Update get_workflow_parameters_table() to merge parameters from both
  workflow and fragger.params files (workflow takes precedence)
- Add search pattern for *.params files
- Add additional MSFragger-specific parameters to the parameters table:
  - Number of Threads
  - Decoy Prefix
  - Isotope Error
  - Mass Offsets
  - Precursor True Tolerance/Units
  - Calibrate Mass
  - Clip N-term Met
  - Min/Max Peptide Length

The implementation follows the same graceful fallback pattern - if either
file is available, parameters will be displayed. If both are available,
workflow parameters take precedence with fragger.params providing additional
information.
…ALIA

Improve FragPipe support and file handling
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 16, 2026

📝 Walkthrough

Walkthrough

This PR substantially expands FragPipe integration by introducing I/O handlers for workflow parameters, manifest files, and combined protein/peptide/ion outputs. It adds experiment-aware file grouping in the service layer, refactors shared plotting functions to accept data_type parameters, and updates test/documentation datasets from PXD066146 to PXD062399.

Changes

Cohort / File(s) Summary
FragPipe Module Expansion
pmultiqc/modules/fragpipe/fragpipe.py, pmultiqc/modules/fragpipe/fragpipe_io.py
Added 12+ new I/O reader functions (workflow_reader, fragger_params_reader, manifest_reader, combined_protein/peptide/ion_reader, etc.); new data processing helpers (get_workflow_parameters_table, get_experiment_design_table, get_protein_intensity_distribution, get_msms_counts_per_peak, cal_peptide_id_gain); 4 new static plotting methods (draw_parameters, draw_experiment_design, draw_protein_intensity_distribution, draw_mbr_contribution); extended FragPipeModule to load and render workflow params, manifest, protein/peptide/ion data.
Configuration & Service Layer
pmultiqc/main.py, pmultiqc_service/app.py
Added default MultiQC config entries for FragPipe file patterns (*.workflow, *.fp-manifest, *.params); introduced FRAGPIPE_FILE_SUFFIXES constant and group_fragpipe_files() function to enable experiment-aware file grouping and batch processing in PRIDE job handling.
Plotting Infrastructure Refactor
pmultiqc/modules/common/plots/id.py, pmultiqc/modules/maxquant/maxquant.py, pmultiqc/modules/maxquant/maxquant_plots.py, pmultiqc/modules/mzidentml/mzidentml.py, pmultiqc/modules/quantms/quantms.py
Refactored draw_identification to accept optional draw_peptide_id_count flag; replaced is_maxquant boolean with data_type: str parameter in draw_oversampling; updated draw_evidence_peptide_id_count signature to accept data_type and render conditional description text; propagated data_type argument across maxquant, mzidentml, and quantms modules.
Test Data & Documentation
.github/workflows/python-app.yml, docs/config.json, docs/README.md
Updated FragPipe test workflow to download and extract PXD062399.zip instead of individual TSV files; updated example dataset references from PXD066146 to PXD062399 in config and README.

Sequence Diagram

sequenceDiagram
    participant User
    participant Service as pmultiqc_service.app
    participant Discovery as File Discovery
    participant Readers as fragpipe_io Readers
    participant Module as FragPipeModule
    participant Plots as Plotting Functions

    User->>Service: Submit PRIDE job (FragPipe files)
    Service->>Discovery: Scan downloaded files
    Discovery-->>Service: Return fragpipe files list
    Service->>Service: group_fragpipe_files() by experiment
    
    loop For each experiment group
        Service->>Readers: Load workflow, manifest, combined_*
        Readers->>Readers: Parse/validate files
        Readers-->>Module: Return structured data
        
        Module->>Readers: Process workflow + fragger params
        Readers-->>Module: parameters_table
        
        Module->>Readers: Extract manifest
        Readers-->>Module: experiment_design_table
        
        Module->>Readers: Load combined_protein/peptide/ion
        Readers-->>Module: protein_df, peptide_df, ion_df
        Module->>Module: Compute distributions & metrics
        
        Module->>Plots: Render parameters, experiment design
        Module->>Plots: Render protein intensity distribution
        Module->>Plots: Render MS/MS counts (via draw_oversampling)
        Plots-->>User: MultiQC report with FragPipe sections
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

Suggested labels

Review effort 4/5, feature, fragpipe, service-layer

Suggested reviewers

  • yueqixuan
  • daichengxin

Poem

🐰 FragPipe flows now through our pipes so clear,
With manifests and proteins dancing here,
Experiment designs in tables bright,
MS/MS counts plotting through the night,
A rabbit hops through peptides with delight! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 62.16% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Title check ❓ Inconclusive The title 'Improving FragPipe' is vague and non-descriptive, using generic phrasing that doesn't clearly convey the main changes (comprehensive QC support, file parsing, visualization features). Use a more specific title that captures the main enhancement, such as 'Add comprehensive FragPipe QC visualizations and MSFragger parameter parsing' or 'Enhance FragPipe integration with combined output analysis'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@pmultiqc/modules/fragpipe/fragpipe_io.py`:
- Around line 761-793: The code initializes peptide-level entries in mbr_stats
but never fills them; either remove the 'peptides' initialization or implement
peptide-level MBR calculation analogous to the protein block: when a peptide
dataframe (e.g., peptide_df or peptides_df) is available, locate the matching
spectral count column in that dataframe (similar logic used for protein_df), set
intensity_col = sample, compute has_spectral, has_intensity, then compute
msms_only, mbr_only, both and assign them into mbr_stats[sample]['peptides'] as
ints (mirroring the structure used for mbr_stats[sample]['proteins']).
🧹 Nitpick comments (6)
pmultiqc/modules/fragpipe/fragpipe_io.py (3)

66-77: Consider using exact match for psm.tsv for consistency.

Lines 68-75 use exact filename matching for ion.tsv, combined_ion.tsv, combined_protein.tsv, and combined_peptide.tsv. However, line 76 uses "psm" in filename which could unintentionally match files like mypsm.tsv or psm_backup.tsv.

♻️ Suggested fix for consistent matching
         elif filename == "combined_peptide.tsv":
             fragpipe_files["combined_peptide"].append(full_path)
-        elif "psm" in filename:
+        elif filename == "psm.tsv":
             fragpipe_files["psm"].append(full_path)

318-353: Consider extracting common key-value parsing logic.

fragger_params_reader and workflow_reader share nearly identical parsing logic. Consider extracting a shared helper function.

♻️ Optional refactor to reduce duplication
def _parse_key_value_file(file_path: str, file_type: str) -> dict | None:
    """Parse a key=value format file."""
    parameters = {}
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            for line in f:
                line = line.strip()
                if not line or line.startswith('#'):
                    continue
                if '=' in line:
                    key, value = line.split('=', 1)
                    parameters[key.strip()] = value.strip()
    except Exception as e:
        log.warning(f"Error reading {file_type} file {file_path}: {e}")
        return None
    log.info(f"Loaded {len(parameters)} parameters from {file_type} file")
    return parameters

def workflow_reader(file_path: str):
    return _parse_key_value_file(file_path, "workflow")

def fragger_params_reader(file_path: str):
    return _parse_key_value_file(file_path, "fragger.params")

356-361: Use explicit Optional or union type for optional parameter.

Per PEP 484, use explicit Optional[dict] or dict | None instead of dict = None.

♻️ Suggested fix
-def get_workflow_parameters_table(parameters: dict, fragger_params: dict = None):
+def get_workflow_parameters_table(parameters: dict, fragger_params: dict | None = None):
pmultiqc/modules/fragpipe/fragpipe.py (3)

202-225: Prefix unused unpacked variables with underscore.

mbr_cols (line 205) and mbr_info (line 222) are unpacked but never used. Prefix with underscore to indicate intentional non-use.

♻️ Suggested fix
-                self.protein_df, self.protein_intensity_cols, mbr_cols = combined_protein_reader(protein_path)
+                self.protein_df, self.protein_intensity_cols, _mbr_cols = combined_protein_reader(protein_path)
-                self.peptide_df, self.peptide_sample_cols, mbr_info = combined_peptide_reader(peptide_path)
+                self.peptide_df, self.peptide_sample_cols, _mbr_info = combined_peptide_reader(peptide_path)

1123-1127: Use explicit Optional or union type for optional parameter.

Per PEP 484, use explicit type annotation for optional parameters.

♻️ Suggested fix
     `@staticmethod`
-    def draw_protein_intensity_distribution(sub_section, sample_distribution: dict, contam_distribution: dict = None):
+    def draw_protein_intensity_distribution(sub_section, sample_distribution: dict, contam_distribution: dict | None = None):

1212-1222: Consider using any() for cleaner validation.

The loop variable sample is unused. Consider using any() with a generator expression.

♻️ Suggested refactor
-        # Check if we have any meaningful MBR data
-        has_data = False
-        for sample, stats in mbr_stats.items():
-            proteins = stats.get('proteins', {})
-            if proteins.get('mbr_only', 0) > 0 or proteins.get('both', 0) > 0:
-                has_data = True
-                break
-
-        if not has_data:
+        # Check if we have any meaningful MBR data
+        has_data = any(
+            stats.get('proteins', {}).get('mbr_only', 0) > 0 or 
+            stats.get('proteins', {}).get('both', 0) > 0
+            for stats in mbr_stats.values()
+        )
+        if not has_data:
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e6053ac and 1819a98.

📒 Files selected for processing (3)
  • pmultiqc/main.py
  • pmultiqc/modules/fragpipe/fragpipe.py
  • pmultiqc/modules/fragpipe/fragpipe_io.py
🧰 Additional context used
🧬 Code graph analysis (1)
pmultiqc/modules/fragpipe/fragpipe.py (1)
pmultiqc/modules/fragpipe/fragpipe_io.py (11)
  • workflow_reader (282-315)
  • fragger_params_reader (318-353)
  • get_workflow_parameters_table (356-461)
  • manifest_reader (464-508)
  • get_experiment_design_table (511-544)
  • combined_protein_reader (547-611)
  • get_protein_intensity_distribution (614-671)
  • combined_peptide_reader (674-726)
  • get_mbr_stats (729-793)
  • combined_ion_reader (796-864)
  • get_msms_counts_per_peak (867-917)
🪛 Ruff (0.14.11)
pmultiqc/modules/fragpipe/fragpipe.py

171-171: Do not catch blind exception: Exception

(BLE001)


181-181: Do not catch blind exception: Exception

(BLE001)


198-198: Do not catch blind exception: Exception

(BLE001)


205-205: Unpacked variable mbr_cols is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


215-215: Do not catch blind exception: Exception

(BLE001)


222-222: Unpacked variable mbr_info is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


224-224: Do not catch blind exception: Exception

(BLE001)


234-234: Do not catch blind exception: Exception

(BLE001)


247-247: Do not catch blind exception: Exception

(BLE001)


1124-1124: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


1214-1214: Loop control variable sample not used within loop body

(B007)

pmultiqc/modules/fragpipe/fragpipe_io.py

310-310: Do not catch blind exception: Exception

(BLE001)


348-348: Do not catch blind exception: Exception

(BLE001)


356-356: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)


504-504: Consider moving this statement to an else block

(TRY300)


506-506: Do not catch blind exception: Exception

(BLE001)


570-570: Do not catch blind exception: Exception

(BLE001)


696-696: Do not catch blind exception: Exception

(BLE001)


817-817: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: test_tmt
  • GitHub Check: test_fragpipe
  • GitHub Check: test_maxquant_dia
  • GitHub Check: test_maxquant
  • GitHub Check: test_dia
  • GitHub Check: test_diann
  • GitHub Check: test_lfq
  • GitHub Check: test_proteobench
  • GitHub Check: test_mzid_mgf
  • GitHub Check: test_mzid_mzML
  • GitHub Check: Codacy Static Code Analysis
🔇 Additional comments (18)
pmultiqc/modules/fragpipe/fragpipe_io.py (8)

21-34: LGTM!

The REQUIRED_COLS dictionary is properly extended with the new file types (combined_protein, combined_peptide, combined_ion). The column definitions align with FragPipe output file specifications.


282-315: LGTM!

The workflow_reader function correctly parses key=value format files with proper handling for comments and empty lines. The broad exception handling (flagged by static analysis) is acceptable here as graceful degradation is the intended behavior for optional file loading.


464-508: LGTM!

The manifest_reader handles variable-column manifest files correctly with dynamic column assignment. The error handling with graceful fallback to None is appropriate for optional file loading.


547-611: LGTM with note on MBR column parsing.

The function correctly loads protein data and identifies sample intensity columns. The MBR column parsing logic at lines 598-607 uses string replacement to extract sample base names, which may not handle all naming conventions, but the fallback behavior is safe.


614-671: LGTM!

The function correctly separates sample and contaminant proteins, applies log2 transformation with proper filtering of non-positive values, and handles edge cases gracefully.


796-864: LGTM!

The function correctly reads combined_ion.tsv with robust sample column identification using both anchor-based and pattern-matching fallback approaches. Consistent with the existing ion_reader implementation.


867-917: LGTM!

The function provides two modes of operation based on available data: detailed spectral count statistics or intensity-based identification counts. The fallback approach ensures useful output even when spectral count columns are unavailable.


530-543: Bug: Incorrect column existence check in row iteration.

The checks like if 'experiment' in row test whether the string 'experiment' is a value in the row, not whether the column exists. Since you're iterating with iterrows(), row is a pandas Series where in checks values, not index labels.

🐛 Proposed fix
     for idx, row in manifest_df.iterrows():
         file_name = row.get('file_name', row.get('file_path', f'File_{idx}'))
         entry = {
             "file_name": file_name,
         }
-        if 'experiment' in row:
+        if 'experiment' in manifest_df.columns:
             entry["experiment"] = row['experiment']
-        if 'bioreplicate' in row:
+        if 'bioreplicate' in manifest_df.columns:
             entry["bioreplicate"] = row['bioreplicate']
-        if 'data_type' in row:
+        if 'data_type' in manifest_df.columns:
             entry["data_type"] = row['data_type']

         table_data[idx + 1] = entry

Likely an incorrect or invalid review comment.

pmultiqc/modules/fragpipe/fragpipe.py (7)

100-127: LGTM!

New data containers are properly initialized for all the additional FragPipe file types. The attribute grouping with comments improves code organization.


256-269: LGTM!

Parameters and experiment design tables are properly integrated with appropriate order values (1 and 2) ensuring they appear at the top of the experiment section.


380-400: LGTM!

New visualizations for protein intensity distribution, MBR contribution, and MS/MS counts are properly guarded with data availability checks before drawing.


402-416: Good defensive coding with None filtering.

Using .get("experiment") and filtering out None values from section_group_dict prevents KeyError if the experiment section is not configured.


1021-1068: LGTM!

The draw_parameters method follows the established pattern for table rendering with appropriate configuration options.


1070-1121: LGTM!

The draw_experiment_design method is well-structured with descriptive headers and appropriate helptext.


1292-1413: LGTM!

The draw_msms_counts method handles two data formats appropriately with distinct visualizations. The data availability checks ensure safe dictionary access.

pmultiqc/main.py (3)

155-161: LGTM!

The workflow file pattern registration follows the established convention and enables proper file discovery for FragPipe configuration files.


162-167: LGTM!

The fp-manifest pattern registration is appropriately specific for FragPipe manifest files.


155-174: The *.params pattern is appropriate for this use case.

The broad pattern *.params at the discovery level is intentional and safe. The fragpipe module explicitly filters discovered files to only process the exact filename fragger.params (see fragpipe_io.py line 96: if filename == "fragger.params"), preventing any false positives from being processed.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines +761 to +793
for sample in sample_cols:
mbr_stats[sample] = {
'proteins': {'msms_only': 0, 'mbr_only': 0, 'both': 0},
'peptides': {'msms_only': 0, 'mbr_only': 0, 'both': 0}
}

# Protein-level MBR stats
if protein_df is not None:
spectral_col = None
intensity_col = sample

# Find corresponding spectral count column
for col in protein_df.columns:
if sample.replace(' MaxLFQ Intensity', '').replace(' Intensity', '') in col:
if 'spectral count' in col.lower():
spectral_col = col
break

if spectral_col and intensity_col in protein_df.columns:
has_spectral = protein_df[spectral_col] > 0
has_intensity = protein_df[intensity_col] > 0

msms_only = ((has_spectral) & (~has_intensity)).sum()
mbr_only = ((~has_spectral) & (has_intensity)).sum()
both = ((has_spectral) & (has_intensity)).sum()

mbr_stats[sample]['proteins'] = {
'msms_only': int(msms_only),
'mbr_only': int(mbr_only),
'both': int(both)
}

return mbr_stats
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Peptide-level MBR statistics are initialized but never calculated.

The function initializes mbr_stats[sample]['peptides'] with zeros at line 764, but there's no corresponding calculation for peptide-level MBR stats (only protein-level stats are computed at lines 768-791). Either implement the peptide calculation or remove the misleading initialization.

💡 Option 1: Remove unused peptide stats
         mbr_stats[sample] = {
-            'proteins': {'msms_only': 0, 'mbr_only': 0, 'both': 0},
-            'peptides': {'msms_only': 0, 'mbr_only': 0, 'both': 0}
+            'proteins': {'msms_only': 0, 'mbr_only': 0, 'both': 0}
         }

Or implement peptide-level MBR calculations if this data is needed for future visualization.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for sample in sample_cols:
mbr_stats[sample] = {
'proteins': {'msms_only': 0, 'mbr_only': 0, 'both': 0},
'peptides': {'msms_only': 0, 'mbr_only': 0, 'both': 0}
}
# Protein-level MBR stats
if protein_df is not None:
spectral_col = None
intensity_col = sample
# Find corresponding spectral count column
for col in protein_df.columns:
if sample.replace(' MaxLFQ Intensity', '').replace(' Intensity', '') in col:
if 'spectral count' in col.lower():
spectral_col = col
break
if spectral_col and intensity_col in protein_df.columns:
has_spectral = protein_df[spectral_col] > 0
has_intensity = protein_df[intensity_col] > 0
msms_only = ((has_spectral) & (~has_intensity)).sum()
mbr_only = ((~has_spectral) & (has_intensity)).sum()
both = ((has_spectral) & (has_intensity)).sum()
mbr_stats[sample]['proteins'] = {
'msms_only': int(msms_only),
'mbr_only': int(mbr_only),
'both': int(both)
}
return mbr_stats
for sample in sample_cols:
mbr_stats[sample] = {
'proteins': {'msms_only': 0, 'mbr_only': 0, 'both': 0}
}
# Protein-level MBR stats
if protein_df is not None:
spectral_col = None
intensity_col = sample
# Find corresponding spectral count column
for col in protein_df.columns:
if sample.replace(' MaxLFQ Intensity', '').replace(' Intensity', '') in col:
if 'spectral count' in col.lower():
spectral_col = col
break
if spectral_col and intensity_col in protein_df.columns:
has_spectral = protein_df[spectral_col] > 0
has_intensity = protein_df[intensity_col] > 0
msms_only = ((has_spectral) & (~has_intensity)).sum()
mbr_only = ((~has_spectral) & (has_intensity)).sum()
both = ((has_spectral) & (has_intensity)).sum()
mbr_stats[sample]['proteins'] = {
'msms_only': int(msms_only),
'mbr_only': int(mbr_only),
'both': int(both)
}
return mbr_stats
🤖 Prompt for AI Agents
In `@pmultiqc/modules/fragpipe/fragpipe_io.py` around lines 761 - 793, The code
initializes peptide-level entries in mbr_stats but never fills them; either
remove the 'peptides' initialization or implement peptide-level MBR calculation
analogous to the protein block: when a peptide dataframe (e.g., peptide_df or
peptides_df) is available, locate the matching spectral count column in that
dataframe (similar logic used for protein_df), set intensity_col = sample,
compute has_spectral, has_intensity, then compute msms_only, mbr_only, both and
assign them into mbr_stats[sample]['peptides'] as ints (mirroring the structure
used for mbr_stats[sample]['proteins']).

claude and others added 2 commits January 16, 2026 10:04
…eports

When processing PRIDE datasets containing multiple FragPipe files (e.g.,
psm.tsv and ion.tsv), the service was creating separate reports for each
file. This fix groups FragPipe files by their experiment prefix so that
related files are processed together as a single experiment.

Changes:
- Add FRAGPIPE_FILE_SUFFIXES constant listing all FragPipe file types
- Add group_fragpipe_files() function to group files by experiment prefix
- Modify PRIDE processing loop to process grouped files together
- Files without prefix (psm.tsv, ion.tsv) are grouped as "fragpipe_experiment"
- Files with prefix ({exp}_psm.tsv, {exp}_ion.tsv) are grouped by prefix
…s-Aq8jP

Fix duplicate reports for ion and psm files
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@pmultiqc_service/app.py`:
- Around line 838-840: The current loop uses the filename stem (file_name =
os.path.splitext(os.path.basename(file_path))[0]) as the group key and assigns
groups[file_name] = [file_path], which overwrites prior entries with the same
stem; change the logic in the loop that processes non_fragpipe_files so it uses
the full filename (e.g., os.path.basename(file_path)) as the key or, if you must
keep the stem, check if groups already has that key and append file_path to the
existing list instead of replacing it (i.e., if key in groups:
groups[key].append(file_path) else: groups[key] = [file_path]); update the block
that sets file_name and assigns into groups accordingly.
🧹 Nitpick comments (1)
pmultiqc_service/app.py (1)

768-780: Group FragPipe config/manifest files with the experiment (if they’re downloaded).

Line 768–780 only covers TSV outputs. If fragpipe.workflow, fragger.params, or fp-manifest are part of the download set, they’ll be treated as standalone groups and won’t be present when the FragPipe group runs, so parameter parsing can be skipped. Consider adding these suffixes (and, if they’re global, copying them into each group).

♻️ Possible tweak
 FRAGPIPE_FILE_SUFFIXES = [
+    "fragpipe.workflow",
+    "fragger.params",
+    "fp-manifest",
     "combined_modified_peptide.tsv",
     "combined_peptide.tsv",
     "combined_protein.tsv",
     "combined_ion.tsv",
     "peptide.tsv",
     "protein.tsv",
     "psm.tsv",
     "ion.tsv",
 ]
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1819a98 and 3f7be83.

📒 Files selected for processing (1)
  • pmultiqc_service/app.py
🧰 Additional context used
🪛 Ruff (0.14.11)
pmultiqc_service/app.py

1421-1421: Do not catch blind exception: Exception

(BLE001)


1461-1461: Do not catch blind exception: Exception

(BLE001)


1462-1462: Use logging.exception instead of logging.error

Replace with exception

(TRY400)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: test_maxquant_dia
  • GitHub Check: test_maxquant
  • GitHub Check: test_diann
  • GitHub Check: test_fragpipe
  • GitHub Check: test_dia
  • GitHub Check: test_mzid_mzML
  • GitHub Check: test_tmt
  • GitHub Check: test_mzid_mgf
  • GitHub Check: test_lfq
  • GitHub Check: test_proteobench
  • GitHub Check: Codacy Static Code Analysis
🔇 Additional comments (1)
pmultiqc_service/app.py (1)

1351-1457: Group-based processing flow looks solid.

The per-group extraction, input-type detection, and progress reporting in Lines 1351–1457 align well with the new grouping strategy.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines +838 to +840
for file_path in non_fragpipe_files:
file_name = os.path.splitext(os.path.basename(file_path))[0]
groups[file_name] = [file_path]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid overwriting non‑FragPipe groups with the same stem.

Line 839–840 uses the filename stem as the group key and overwrites any prior entry; files like sample.mgf and sample.mzML would collapse into one group, dropping one file. Use the full filename and append when the key already exists.

🐛 Proposed fix
-    for file_path in non_fragpipe_files:
-        file_name = os.path.splitext(os.path.basename(file_path))[0]
-        groups[file_name] = [file_path]
+    for file_path in non_fragpipe_files:
+        file_name = os.path.basename(file_path)
+        groups.setdefault(file_name, []).append(file_path)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for file_path in non_fragpipe_files:
file_name = os.path.splitext(os.path.basename(file_path))[0]
groups[file_name] = [file_path]
for file_path in non_fragpipe_files:
file_name = os.path.basename(file_path)
groups.setdefault(file_name, []).append(file_path)
🤖 Prompt for AI Agents
In `@pmultiqc_service/app.py` around lines 838 - 840, The current loop uses the
filename stem (file_name = os.path.splitext(os.path.basename(file_path))[0]) as
the group key and assigns groups[file_name] = [file_path], which overwrites
prior entries with the same stem; change the logic in the loop that processes
non_fragpipe_files so it uses the full filename (e.g.,
os.path.basename(file_path)) as the key or, if you must keep the stem, check if
groups already has that key and append file_path to the existing list instead of
replacing it (i.e., if key in groups: groups[key].append(file_path) else:
groups[key] = [file_path]); update the block that sets file_name and assigns
into groups accordingly.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
pmultiqc/modules/common/file_utils.py (2)

156-157: Missing exception handling for rarfile errors.

The exception handler doesn't catch rarfile.BadRarFile or rarfile.Error, which would be raised for corrupt or invalid RAR files.

Proposed fix
-    except (zipfile.BadZipFile, tarfile.TarError, gzip.BadGzipFile, OSError, ValueError) as e:
+    except (zipfile.BadZipFile, tarfile.TarError, gzip.BadGzipFile, rarfile.BadRarFile, rarfile.Error, OSError, ValueError) as e:

166-168: Missing .rar in archive types list.

The is_archive_file function doesn't include .rar in the recognized archive types, creating an inconsistency with extract_archive_file which now handles RAR files.

Proposed fix
 def is_archive_file(file_path: str) -> bool:
-    archive_types = [".zip", ".gz", ".tar", ".tar.gz", ".tgz", ".tar.bz2"]
+    archive_types = [".zip", ".gz", ".tar", ".tar.gz", ".tgz", ".tar.bz2", ".rar"]
     return any(file_path.lower().endswith(arch_type) for arch_type in archive_types)
pmultiqc/modules/maxquant/maxquant_plots.py (1)

446-479: Guard unknown data_type to avoid UnboundLocalError.

description_text / help_text are only set in the two explicit branches, so any other value (including case mismatches) will crash at add_sub_section. Initialize defaults and normalize data_type defensively.

🛠️ Proposed fix
-    if data_type == "maxquant":
+    data_type_norm = (data_type or "").lower()
+    description_text = ""
+    help_text = ""
+    if data_type_norm == "maxquant":
         description_text = """
             [Excludes Contaminants] Number of unique (i.e. not counted twice) peptide sequences including modifications (after FDR) per Raw file.
             """
         help_text="""
             If MBR was enabled, three categories ('Genuine (Exclusive)', 'Genuine + Transferred', 'Transferred (Exclusive)'
             are shown, so the user can judge the gain that MBR provides.
             ...
             """
-    elif data_type == "fragpipe":
+    elif data_type_norm == "fragpipe":
         description_text = """
             combined_peptide.tsv
             """
         help_text = """
             combined_peptide.tsv
             """
+    else:
+        description_text = "[Excludes Contaminants] Number of unique peptide sequences per file."
+        help_text = "Counts after FDR filtering."
🤖 Fix all issues with AI agents
In `@pmultiqc/modules/common/file_utils.py`:
- Around line 128-131: The extract_rar function currently calls
rarfile.RarFile(...).extractall without the same security checks as
extract_zip/extract_tar; update extract_rar to mirror those protections by
iterating through rf.infolist()/rf.namelist(), computing the resolved target
path for each member, and ensuring the resolved path is inside the provided
root_dir (preventing path traversal), disallowing/sanitizing absolute paths and
symlinks, and enforcing archive-bomb limits (e.g., max total uncompressed bytes
and max file count) before extraction; if any check fails, raise an exception
and do not extract. Use the same helper logic/function names used by
extract_zip/extract_tar (reuse or extract shared helpers) to keep behavior
consistent.

In `@pmultiqc/modules/fragpipe/fragpipe_io.py`:
- Around line 846-870: validate_columns_existence currently mutates the shared
REQUIRED_KEYWORDS entry for data_name which causes stale state across calls; to
fix, create a fresh per-call copy of the required keywords at the start of
validate_columns_existence (e.g., local_required =
REQUIRED_KEYWORDS[data_name].copy() or use dict(...) / deepcopy as appropriate)
and then use local_required for all mutations and the final
all(local_required.values()) and reporting, leaving the global REQUIRED_KEYWORDS
unchanged.
- Around line 930-969: In cal_peptide_id_gain, guard against division by zero
when computing count_df["MBRgain"]: replace the current raw division using
count_df["mbr"] / count_df["ms/ms_count"] with a safe computation that avoids
inf/NaN (e.g., compute per-row MBRgain only when "ms/ms_count" > 0, otherwise
set NaN or 0), then sanitize infinite values (np.inf) to NaN and compute
mbr_gain using .dropna().mean() before rounding; finally build title_value only
when mbr_gain is a finite number. Reference: function cal_peptide_id_gain,
DataFrame count_df, column "ms/ms_count", "mbr", and the computed "MBRgain".
♻️ Duplicate comments (1)
pmultiqc/modules/fragpipe/fragpipe_io.py (1)

769-799: Peptide-level MBR stats are initialized but never populated.

mbr_stats[sample]["peptides"] stays zeroed because no peptide-level calculation exists. Either compute it or drop the field to avoid misleading output.

🧹 Minimal fix (remove unused peptides key)
-        mbr_stats[sample] = {
-            'proteins': {'msms_only': 0, 'mbr_only': 0, 'both': 0},
-            'peptides': {'msms_only': 0, 'mbr_only': 0, 'both': 0}
-        }
+        mbr_stats[sample] = {
+            'proteins': {'msms_only': 0, 'mbr_only': 0, 'both': 0}
+        }
🧹 Nitpick comments (1)
.github/workflows/python-app.yml (1)

272-281: Code changes are correct; consider documenting the unrar dependency for local users.

The workflow correctly installs unrar and the code uses it properly. However, rarfile requires an external decompressor (unrar, unar, 7z, or bsdtar) to be installed on the system. While the CI handles this, users installing locally without this dependency will encounter errors. The current setup instructions in CONTRIBUTING.md don't mention this external tool requirement—adding a note would improve the developer experience.

Comment on lines 846 to 870
def validate_columns_existence(df_columns, data_name: str):

col_list = list(df_columns)

required_keywords = REQUIRED_KEYWORDS[data_name]

for col in col_list:
if "Spectral Count" in col:
required_keywords["Spectral Count"] = True
if "Match Type" in col:
required_keywords["Match Type"] = True
if "Intensity" in col:
if "MaxLFQ" not in col:
required_keywords["Intensity"] = True

all_passed = all(required_keywords.values())

print(f"Check whether the data {data_name} meets the extraction requirements.")


for key, found in required_keywords.items():
status = "exists" if found else "is missing"
print(f"{key}: {status}")

return all_passed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reset REQUIRED_KEYWORDS per call to avoid stale validation.

validate_columns_existence mutates the shared REQUIRED_KEYWORDS dict, so later calls can incorrectly pass even if required columns are missing. Use a fresh dict per call.

🛠️ Proposed fix
-    required_keywords = REQUIRED_KEYWORDS[data_name]
+    required_keywords = {k: False for k in REQUIRED_KEYWORDS[data_name].keys()}
🤖 Prompt for AI Agents
In `@pmultiqc/modules/fragpipe/fragpipe_io.py` around lines 846 - 870,
validate_columns_existence currently mutates the shared REQUIRED_KEYWORDS entry
for data_name which causes stale state across calls; to fix, create a fresh
per-call copy of the required keywords at the start of
validate_columns_existence (e.g., local_required =
REQUIRED_KEYWORDS[data_name].copy() or use dict(...) / deepcopy as appropriate)
and then use local_required for all mutations and the final
all(local_required.values()) and reporting, leaving the global REQUIRED_KEYWORDS
unchanged.

claude and others added 2 commits January 18, 2026 07:53
The extract_rar function was missing security validations present in
extract_zip and extract_tar. This adds:
- Path traversal detection (../ patterns, absolute paths)
- Archive bomb detection (compression ratio > 100:1 or > 10GB uncompressed)
- Proper type hints and logging
- rarfile.Error to the exception handling in extract_archive_file
…y-BCoTR

Fix security issue in RAR file extraction
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
pmultiqc/modules/common/file_utils.py (1)

201-203: Missing .rar in archive type detection.

The is_archive_file function doesn't include .rar in archive_types, but extract_archive_file supports RAR extraction. When is_archive_file is used in main.py:66 to determine whether to extract submitted files, RAR archives will not be detected and won't be extracted.

🔧 Proposed fix
 def is_archive_file(file_path: str) -> bool:
-    archive_types = [".zip", ".gz", ".tar", ".tar.gz", ".tgz", ".tar.bz2"]
+    archive_types = [".zip", ".gz", ".tar", ".tar.gz", ".tgz", ".tar.bz2", ".rar"]
     return any(file_path.lower().endswith(arch_type) for arch_type in archive_types)

@codecov
Copy link

codecov bot commented Jan 18, 2026

Codecov Report

❌ Patch coverage is 5.62771% with 436 lines in your changes missing coverage. Please review.
✅ Project coverage is 8.39%. Comparing base (a5ab34a) to head (11bd7ae).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
pmultiqc/modules/fragpipe/fragpipe_io.py 4.82% 276 Missing ⚠️
pmultiqc/modules/fragpipe/fragpipe.py 6.80% 137 Missing ⚠️
pmultiqc/modules/common/plots/id.py 10.00% 9 Missing ⚠️
pmultiqc/modules/maxquant/maxquant_plots.py 12.50% 7 Missing ⚠️
pmultiqc/main.py 0.00% 6 Missing ⚠️
pmultiqc/modules/common/file_utils.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##            main    #544      +/-   ##
========================================
- Coverage   8.63%   8.39%   -0.25%     
========================================
  Files         44      44              
  Lines       5793    6233     +440     
========================================
+ Hits         500     523      +23     
- Misses      5293    5710     +417     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
pmultiqc/modules/common/file_utils.py (1)

127-146: Remove undefined rarfile.Error from exception handling.

Line 145 references rarfile.Error in the except clause, but rarfile is not imported and is not declared as a dependency anywhere in the repository. This causes an undefined name error (F821) and will raise a NameError at runtime. Remove rarfile.Error from the exception tuple unless RAR support is being actively implemented with the dependency properly declared and imported.

-    except (zipfile.BadZipFile, tarfile.TarError, gzip.BadGzipFile, rarfile.Error, OSError, ValueError) as e:
+    except (zipfile.BadZipFile, tarfile.TarError, gzip.BadGzipFile, OSError, ValueError) as e:
         raise ValueError(f"Failed to extract {file_path}: {e}") from e
pmultiqc/modules/fragpipe/fragpipe_io.py (1)

174-189: Convert object-dtype sample intensity columns to numeric in-place.

Line 184 validates that object columns can be converted to numeric but does not apply the conversion, leaving them as strings in the returned dataframe. Downstream at line 231, get_ion_intensity_data() performs intensities[intensities > 0] comparisons, which will raise TypeError or behave unexpectedly on string-typed columns.

Assign the result of pd.to_numeric() back to the dataframe column and verify the conversion produced valid numeric values:

🔧 Proposed fix
-        elif ion_df[col].dtype == object:
-            # Try to convert to numeric
-            try:
-                pd.to_numeric(ion_df[col], errors='raise')
-                valid_sample_cols.append(col)
-            except (ValueError, TypeError):
-                pass
+        elif ion_df[col].dtype == object:
+            # Try to convert to numeric
+            try:
+                ion_df[col] = pd.to_numeric(ion_df[col], errors='coerce')
+                if ion_df[col].notna().any():
+                    valid_sample_cols.append(col)
+            except (ValueError, TypeError):
+                pass
🤖 Fix all issues with AI agents
In `@pmultiqc/modules/fragpipe/fragpipe_io.py`:
- Around line 675-689: The current check uses pd.to_numeric(sample_df[col],
errors='coerce').notna().all() which skips any column that has a single
non-numeric/missing value; instead coerce to numeric and skip only if the entire
column becomes empty. Replace the all() check with logic that converts the
column to numeric, drops NA (e.g., series_numeric =
pd.to_numeric(sample_df[col], errors='coerce').dropna()), continue only when
series_numeric is empty, and then use series_numeric (not the original
sample_df[col]) to compute intensities and contaminant handling for
sample_distribution and contaminant_distribution so partial missingness does not
remove valid data.
- Around line 576-583: Docstrings for combined_protein_reader,
combined_peptide_reader, and combined_ion_reader are out of sync with their
actual return values: each function returns two items (e.g., (protein_df,
sample_intensity_cols) or (peptide_df, sample_intensity_cols) or (ion_df,
validate_columns)) but the docstrings still describe a third item (mbr_cols /
mbr_info / sample_cols). Update each function's Returns section to list only the
two actual returned items and their meanings (remove
mbr_cols/mbr_info/sample_cols descriptions), explicitly name the returned
variables as used in the code (e.g., protein_df and sample_intensity_cols for
combined_protein_reader; peptide_df and sample_intensity_cols for
combined_peptide_reader; ion_df and validate_columns for combined_ion_reader) so
the documentation matches the implementation and call sites.
♻️ Duplicate comments (2)
pmultiqc/modules/fragpipe/fragpipe_io.py (2)

769-799: Peptide MBR stats are initialized but never computed.

Line 771–798 initializes mbr_stats[sample]['peptides'] but no peptide-level calculation is performed. Either compute peptide MBR stats or remove the placeholder to avoid misleading output.


846-861: Avoid mutating global REQUIRED_KEYWORDS.

Line 850: required_keywords = REQUIRED_KEYWORDS[data_name] mutates the shared dict and can cause stale state across calls (later validations can pass incorrectly). Use a per-call copy.

🔧 Proposed fix
-    required_keywords = REQUIRED_KEYWORDS[data_name]
+    required_keywords = {k: False for k in REQUIRED_KEYWORDS[data_name].keys()}

Comment on lines +576 to +583
Returns
-------
tuple
(protein_df, sample_intensity_cols, mbr_cols) where:
- protein_df: DataFrame with protein-level data
- sample_intensity_cols: List of sample intensity column names
- mbr_cols: Dict mapping sample to MBR-related columns
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First, inspect the file to understand structure
wc -l pmultiqc/modules/fragpipe/fragpipe_io.py

# Find the combined_*_reader functions and their return statements
rg -nP '\bcombined_(protein|peptide|ion)_reader' pmultiqc/modules/fragpipe/fragpipe_io.py -A 5 -B 2

Repository: bigbio/pmultiqc

Length of output: 174


🏁 Script executed:

#!/bin/bash
# Inspect specific lines for docstrings and returns
echo "=== Lines 576-583 (combined_protein_reader docstring) ==="
sed -n '576,583p' pmultiqc/modules/fragpipe/fragpipe_io.py

echo "=== Lines 625-635 (combined_protein_reader return) ==="
sed -n '625,635p' pmultiqc/modules/fragpipe/fragpipe_io.py

echo "=== Lines 706-735 (combined_peptide_reader docstring) ==="
sed -n '706,735p' pmultiqc/modules/fragpipe/fragpipe_io.py

echo "=== Lines 818-844 (combined_ion_reader docstring) ==="
sed -n '818,844p' pmultiqc/modules/fragpipe/fragpipe_io.py

Repository: bigbio/pmultiqc

Length of output: 2513


🏁 Script executed:

#!/bin/bash
# Find all call sites of combined_*_reader functions
echo "=== Call sites ==="
rg -nP '\bcombined_(protein|peptide|ion)_reader\s*\(' pmultiqc/modules/fragpipe/fragpipe_io.py -C 3

Repository: bigbio/pmultiqc

Length of output: 598


🏁 Script executed:

#!/bin/bash
# Search for function calls (not definitions)
echo "=== Calls to combined_protein_reader ==="
rg 'combined_protein_reader' pmultiqc/modules/fragpipe/fragpipe_io.py -n | grep -v 'def combined_protein_reader'

echo "=== Calls to combined_peptide_reader ==="
rg 'combined_peptide_reader' pmultiqc/modules/fragpipe/fragpipe_io.py -n | grep -v 'def combined_peptide_reader'

echo "=== Calls to combined_ion_reader ==="
rg 'combined_ion_reader' pmultiqc/modules/fragpipe/fragpipe_io.py -n | grep -v 'def combined_ion_reader'

Repository: bigbio/pmultiqc

Length of output: 381


🏁 Script executed:

#!/bin/bash
# Search across entire fragpipe module for calls
echo "=== All calls to combined_*_reader ==="
rg 'combined_(protein|peptide|ion)_reader' pmultiqc/modules/fragpipe/ -n | grep -v 'def combined_'

Repository: bigbio/pmultiqc

Length of output: 1096


🏁 Script executed:

#!/bin/bash
# Get the exact return statements for each function
echo "=== combined_protein_reader function definition and return ==="
sed -n '563,630p' pmultiqc/modules/fragpipe/fragpipe_io.py | tail -20

echo "=== combined_peptide_reader function definition and return ==="
sed -n '694,730p' pmultiqc/modules/fragpipe/fragpipe_io.py | tail -20

echo "=== combined_ion_reader function definition and return ==="
sed -n '804,845p' pmultiqc/modules/fragpipe/fragpipe_io.py | tail -20

Repository: bigbio/pmultiqc

Length of output: 2242


🏁 Script executed:

#!/bin/bash
# Check the fragpipe.py file to see the exact unpacking at call sites
echo "=== fragpipe.py line 202 context ==="
sed -n '200,205p' pmultiqc/modules/fragpipe/fragpipe.py

echo "=== fragpipe.py line 219 context ==="
sed -n '217,222p' pmultiqc/modules/fragpipe/fragpipe.py

echo "=== fragpipe.py line 233 context ==="
sed -n '231,236p' pmultiqc/modules/fragpipe/fragpipe.py

Repository: bigbio/pmultiqc

Length of output: 1209


Fix docstring return contracts for combined_*_reader functions.

Docstrings incorrectly document return values:

  • combined_protein_reader docstring claims 3 returns (including mbr_cols) but only returns 2
  • combined_peptide_reader docstring claims 3 returns (including mbr_info) but only returns 2
  • combined_ion_reader docstring documents sample_cols but actually returns validate_columns

Call sites in fragpipe.py correctly unpack 2 values, so no runtime errors occur. Update docstrings to match actual returns.

🤖 Prompt for AI Agents
In `@pmultiqc/modules/fragpipe/fragpipe_io.py` around lines 576 - 583, Docstrings
for combined_protein_reader, combined_peptide_reader, and combined_ion_reader
are out of sync with their actual return values: each function returns two items
(e.g., (protein_df, sample_intensity_cols) or (peptide_df,
sample_intensity_cols) or (ion_df, validate_columns)) but the docstrings still
describe a third item (mbr_cols / mbr_info / sample_cols). Update each
function's Returns section to list only the two actual returned items and their
meanings (remove mbr_cols/mbr_info/sample_cols descriptions), explicitly name
the returned variables as used in the code (e.g., protein_df and
sample_intensity_cols for combined_protein_reader; peptide_df and
sample_intensity_cols for combined_peptide_reader; ion_df and validate_columns
for combined_ion_reader) so the documentation matches the implementation and
call sites.

Comment on lines +675 to +689
if not pd.to_numeric(sample_df[col], errors='coerce').notna().all():
continue

# Sample intensities
intensities = sample_df[col].dropna()
valid_intensities = intensities[intensities > 0]
if len(valid_intensities) > 0:
sample_distribution[col] = np.log2(valid_intensities).tolist()

# Contaminant intensities
if not contam_df.empty and col in contam_df.columns:
cont_intensities = contam_df[col].dropna()
valid_cont = cont_intensities[cont_intensities > 0]
if len(valid_cont) > 0:
contaminant_distribution[col] = np.log2(valid_cont).tolist()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if the file exists and read the relevant lines
if [ -f "pmultiqc/modules/fragpipe/fragpipe_io.py" ]; then
    echo "=== File found. Reading lines 670-695 for context ==="
    sed -n '670,695p' pmultiqc/modules/fragpipe/fragpipe_io.py | cat -n
else
    echo "File not found at expected location. Searching for it..."
    find . -name "fragpipe_io.py" -type f
fi

Repository: bigbio/pmultiqc

Length of output: 1172


Avoid dropping entire columns when any single intensity value is missing.

Line 675–676: The check notna().all() skips columns if any value is missing or non-numeric. This is overly strict and silently removes valid samples from QC plots—common in real proteomics datasets with partial missingness. Only skip columns that are completely empty after numeric coercion.

🔧 Proposed fix
-        if not pd.to_numeric(sample_df[col], errors='coerce').notna().all():
-            continue
-
-        # Sample intensities
-        intensities = sample_df[col].dropna()
+        values = pd.to_numeric(sample_df[col], errors='coerce')
+        if values.notna().sum() == 0:
+            continue
+
+        # Sample intensities
+        intensities = values.dropna()
         valid_intensities = intensities[intensities > 0]
         if len(valid_intensities) > 0:
             sample_distribution[col] = np.log2(valid_intensities).tolist()
@@
-            cont_intensities = contam_df[col].dropna()
+            cont_intensities = pd.to_numeric(contam_df[col], errors='coerce').dropna()
             valid_cont = cont_intensities[cont_intensities > 0]
🤖 Prompt for AI Agents
In `@pmultiqc/modules/fragpipe/fragpipe_io.py` around lines 675 - 689, The current
check uses pd.to_numeric(sample_df[col], errors='coerce').notna().all() which
skips any column that has a single non-numeric/missing value; instead coerce to
numeric and skip only if the entire column becomes empty. Replace the all()
check with logic that converts the column to numeric, drops NA (e.g.,
series_numeric = pd.to_numeric(sample_df[col], errors='coerce').dropna()),
continue only when series_numeric is empty, and then use series_numeric (not the
original sample_df[col]) to compute intensities and contaminant handling for
sample_distribution and contaminant_distribution so partial missingness does not
remove valid data.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@pmultiqc/modules/fragpipe/fragpipe.py`:
- Around line 229-240: The success log "Combined ion data loaded successfully."
is emitted before you verify combined_ion_df and combined_ion_column_valid; move
or change the log so it only runs after the validation check. In the try block
call combined_ion_reader(combined_ion_path) as you do, then if combined_ion_df
is not None and combined_ion_column_valid set self.msms_counts via
get_msms_counts_per_peak and call log.info("Combined ion data loaded
successfully.") there; otherwise emit a warning or skip logging. Update
references to combined_ion_reader, combined_ion_df, combined_ion_column_valid,
get_msms_counts_per_peak, self.msms_counts, and log.info accordingly.
- Around line 215-227: The success log is printed before validating the parsed
DataFrame and contains a typo; move the "Combined peptide data loaded
successfully." log so it runs only after confirming peptide_df is not None and
combined_peptide_column_valid, change the other log message in the same block
from "peptide statics loaded successfully." to "peptide statistics loaded
successfully.", and ensure these messages are emitted around the
combined_peptide_reader call and the cal_peptide_id_gain assignment (referencing
combined_peptide_reader, peptide_df, combined_peptide_column_valid,
cal_peptide_id_gain, and self.peptide_id_gain) while keeping the existing
exception log (log.warning) intact.
- Around line 390-404: Either remove the unused get_mbr_stats import and the
commented draw_mbr_contribution block, or populate self.mbr_stats in get_data()
by calling get_mbr_stats(...) with the appropriate protein/peptide DataFrames
and then enable the visualization; specifically, locate the get_data() method in
fragpipe.py and set self.mbr_stats = get_mbr_stats(protein_df, peptide_df) (or
similar variable names used there), ensure get_mbr_stats is imported, and then
uncomment/restore the draw_mbr_contribution(...) call that references
self.mbr_stats and draw_mbr_contribution/sub_sections["identification"] so the
MBR visualization has valid data.
🧹 Nitpick comments (4)
pmultiqc/modules/fragpipe/fragpipe.py (4)

161-185: Consider catching more specific exceptions.

The generic Exception catches work for graceful fallback, but catching specific exceptions (e.g., FileNotFoundError, ValueError, KeyError) would improve debuggability and avoid masking unexpected errors.

♻️ Example for workflow parsing
         if self.fragpipe_files.get("workflow"):
             try:
                 workflow_path = self.fragpipe_files["workflow"][0]
                 self.parameters = workflow_reader(workflow_path)
                 if self.parameters:
                     log.info("Workflow parameters loaded successfully.")
-            except Exception as e:
+            except (FileNotFoundError, IOError, ValueError) as e:
                 log.warning(f"Error parsing workflow file: {e}")

198-213: Use defensive access for contaminant_affix config.

Direct dictionary access config.kwargs["contaminant_affix"] will raise KeyError if the config is missing. Consider using .get() with a default value for robustness.

♻️ Proposed fix
                 if self.protein_df is not None and self.protein_intensity_cols:
-                    contam_affix = config.kwargs["contaminant_affix"]
+                    contam_affix = config.kwargs.get("contaminant_affix", "CONT")
                     (
                         self.protein_intensity_distribution,
                         self.protein_contam_distribution

1152-1165: Fix implicit Optional type hint.

Per PEP 484, use explicit Optional[dict] or dict | None instead of dict = None.

♻️ Proposed fix
     `@staticmethod`
-    def draw_protein_intensity_distribution(sub_section, sample_distribution: dict, contam_distribution: dict = None):
+    def draw_protein_intensity_distribution(sub_section, sample_distribution: dict, contam_distribution: dict | None = None):

1241-1251: Use _ for unused loop variable.

The loop variable sample at line 1243 is not used in the loop body.

♻️ Proposed fix
         has_data = False
-        for sample, stats in mbr_stats.items():
+        for _, stats in mbr_stats.items():
             proteins = stats.get('proteins', {})
             if proteins.get('mbr_only', 0) > 0 or proteins.get('both', 0) > 0:
                 has_data = True
                 break

Comment on lines +215 to +227
# Parse combined_peptide.tsv
if self.fragpipe_files.get("combined_peptide"):
try:
peptide_path = self.fragpipe_files["combined_peptide"][0]
peptide_df, combined_peptide_column_valid = combined_peptide_reader(peptide_path)
log.info("Combined peptide data loaded successfully.")

if peptide_df is not None and combined_peptide_column_valid:
self.peptide_id_gain = cal_peptide_id_gain(peptide_df)
log.info("peptide statics loaded successfully.")

except Exception as e:
log.warning(f"Error parsing combined_peptide.tsv: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix log message placement and typo.

The success log at line 220 fires before validation, and line 224 has a typo ("statics" → "statistics").

🐛 Proposed fix
         if self.fragpipe_files.get("combined_peptide"):
             try:
                 peptide_path = self.fragpipe_files["combined_peptide"][0]
                 peptide_df, combined_peptide_column_valid = combined_peptide_reader(peptide_path)
-                log.info("Combined peptide data loaded successfully.")

                 if peptide_df is not None and combined_peptide_column_valid:
+                    log.info("Combined peptide data loaded successfully.")
                     self.peptide_id_gain = cal_peptide_id_gain(peptide_df)
-                    log.info("peptide statics loaded successfully.")
+                    log.info("Peptide statistics loaded successfully.")

             except Exception as e:
                 log.warning(f"Error parsing combined_peptide.tsv: {e}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Parse combined_peptide.tsv
if self.fragpipe_files.get("combined_peptide"):
try:
peptide_path = self.fragpipe_files["combined_peptide"][0]
peptide_df, combined_peptide_column_valid = combined_peptide_reader(peptide_path)
log.info("Combined peptide data loaded successfully.")
if peptide_df is not None and combined_peptide_column_valid:
self.peptide_id_gain = cal_peptide_id_gain(peptide_df)
log.info("peptide statics loaded successfully.")
except Exception as e:
log.warning(f"Error parsing combined_peptide.tsv: {e}")
# Parse combined_peptide.tsv
if self.fragpipe_files.get("combined_peptide"):
try:
peptide_path = self.fragpipe_files["combined_peptide"][0]
peptide_df, combined_peptide_column_valid = combined_peptide_reader(peptide_path)
if peptide_df is not None and combined_peptide_column_valid:
log.info("Combined peptide data loaded successfully.")
self.peptide_id_gain = cal_peptide_id_gain(peptide_df)
log.info("Peptide statistics loaded successfully.")
except Exception as e:
log.warning(f"Error parsing combined_peptide.tsv: {e}")
🧰 Tools
🪛 Ruff (0.14.13)

226-226: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In `@pmultiqc/modules/fragpipe/fragpipe.py` around lines 215 - 227, The success
log is printed before validating the parsed DataFrame and contains a typo; move
the "Combined peptide data loaded successfully." log so it runs only after
confirming peptide_df is not None and combined_peptide_column_valid, change the
other log message in the same block from "peptide statics loaded successfully."
to "peptide statistics loaded successfully.", and ensure these messages are
emitted around the combined_peptide_reader call and the cal_peptide_id_gain
assignment (referencing combined_peptide_reader, peptide_df,
combined_peptide_column_valid, cal_peptide_id_gain, and self.peptide_id_gain)
while keeping the existing exception log (log.warning) intact.

Comment on lines +229 to +240
# Parse combined_ion.tsv for MS/MS counts
if self.fragpipe_files.get("combined_ion"):
try:
combined_ion_path = self.fragpipe_files["combined_ion"][0]
combined_ion_df, combined_ion_column_valid = combined_ion_reader(combined_ion_path)
log.info("Combined ion data loaded successfully.")

if combined_ion_df is not None and combined_ion_column_valid:
self.msms_counts = get_msms_counts_per_peak(combined_ion_df)
log.info("MS/MS counts per peak loaded successfully.")
except Exception as e:
log.warning(f"Error parsing combined_ion.tsv: {e}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Log message fires before validation check.

Similar to the peptide section, the success log at line 234 fires before the validity check at line 236.

🐛 Proposed fix
         if self.fragpipe_files.get("combined_ion"):
             try:
                 combined_ion_path = self.fragpipe_files["combined_ion"][0]
                 combined_ion_df, combined_ion_column_valid = combined_ion_reader(combined_ion_path)
-                log.info("Combined ion data loaded successfully.")

                 if combined_ion_df is not None and combined_ion_column_valid:
+                    log.info("Combined ion data loaded successfully.")
                     self.msms_counts = get_msms_counts_per_peak(combined_ion_df)
                     log.info("MS/MS counts per peak loaded successfully.")
🧰 Tools
🪛 Ruff (0.14.13)

239-239: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In `@pmultiqc/modules/fragpipe/fragpipe.py` around lines 229 - 240, The success
log "Combined ion data loaded successfully." is emitted before you verify
combined_ion_df and combined_ion_column_valid; move or change the log so it only
runs after the validation check. In the try block call
combined_ion_reader(combined_ion_path) as you do, then if combined_ion_df is not
None and combined_ion_column_valid set self.msms_counts via
get_msms_counts_per_peak and call log.info("Combined ion data loaded
successfully.") there; otherwise emit a warning or skip logging. Update
references to combined_ion_reader, combined_ion_df, combined_ion_column_valid,
get_msms_counts_per_peak, self.msms_counts, and log.info accordingly.

Comment on lines +390 to +404
# MBR (Match-Between-Runs) visualization
# if self.mbr_stats:
# self.draw_mbr_contribution(
# sub_section=self.sub_sections["identification"],
# mbr_stats=self.mbr_stats
# )

# MS/MS counts per peak from combined_ion.tsv
if self.msms_counts:
draw_oversampling(
self.sub_sections["ms2"],
self.msms_counts,
"",
"fragpipe",
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

rg -n "mbr_stats" --type py

Repository: bigbio/pmultiqc

Length of output: 1291


🏁 Script executed:

# Check the fragpipe.py file for imports and get_mbr_stats usage
rg -n "get_mbr_stats" --type py

Repository: bigbio/pmultiqc

Length of output: 219


🏁 Script executed:

# Get the fragpipe.py file to check imports at the top
head -50 pmultiqc/modules/fragpipe/fragpipe.py | cat -n

Repository: bigbio/pmultiqc

Length of output: 1679


🏁 Script executed:

# Check the get_data method in fragpipe.py
rg -A 100 "def get_data" pmultiqc/modules/fragpipe/fragpipe.py | head -150

Repository: bigbio/pmultiqc

Length of output: 4553


🏁 Script executed:

# Check if get_mbr_stats is called anywhere (not just imported)
rg -B 2 -A 5 "get_mbr_stats\(" pmultiqc/modules/fragpipe/fragpipe.py

Repository: bigbio/pmultiqc

Length of output: 41


🏁 Script executed:

# Search for self.mbr_stats assignment
rg -n "self\.mbr_stats\s*=" pmultiqc/modules/fragpipe/fragpipe.py

Repository: bigbio/pmultiqc

Length of output: 41


🏁 Script executed:

# Get more of the get_data method to see if mbr_stats is populated
sed -n '130,250p' pmultiqc/modules/fragpipe/fragpipe.py | cat -n

Repository: bigbio/pmultiqc

Length of output: 6305


Remove unused MBR import or implement missing data loading.

The get_mbr_stats function is imported but never called, and self.mbr_stats is never populated in get_data(). The commented code at lines 391–395 references this undefined attribute. Either remove the dead import and commented visualization code, or complete the implementation by calling get_mbr_stats() with the protein and peptide dataframes to populate self.mbr_stats.

🤖 Prompt for AI Agents
In `@pmultiqc/modules/fragpipe/fragpipe.py` around lines 390 - 404, Either remove
the unused get_mbr_stats import and the commented draw_mbr_contribution block,
or populate self.mbr_stats in get_data() by calling get_mbr_stats(...) with the
appropriate protein/peptide DataFrames and then enable the visualization;
specifically, locate the get_data() method in fragpipe.py and set self.mbr_stats
= get_mbr_stats(protein_df, peptide_df) (or similar variable names used there),
ensure get_mbr_stats is imported, and then uncomment/restore the
draw_mbr_contribution(...) call that references self.mbr_stats and
draw_mbr_contribution/sub_sections["identification"] so the MBR visualization
has valid data.

@ypriverol ypriverol merged commit 5fd836f into main Jan 18, 2026
22 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants