Conversation
This commit addresses issue #541 by adding comprehensive FragPipe QC support similar to MaxQuant. New features include: **Parameters Section:** - Parse fragpipe.workflow file for configuration parameters - Display key settings (enzyme, mass tolerances, modifications, MBR, etc.) - Added search pattern for *.workflow files **Experiment Design:** - Parse fp-manifest file for experiment design information - Display file names, experiments, bioreplicates, and data types - Added search pattern for *.fp-manifest files **Protein Intensity Distribution:** - Parse combined_protein.tsv for protein-level quantification - Generate boxplot of log2-transformed protein intensities - Separate visualization for sample vs contaminant proteins **Match-Between-Runs (MBR) Visualization:** - Analyze combined_protein.tsv and combined_peptide.tsv for MBR contribution - Display stacked bar chart showing MS/MS only, MBR only, and both categories - Helps assess MBR quality and potential false positives **MS/MS Counts Per Peak:** - Parse combined_ion.tsv for MS/MS spectral counts - Generate boxplot of spectral count distribution per sample - Alternative bar chart view for identification status All new features use graceful fallback - missing files don't prevent report generation. This follows the same pattern used by MaxQuant and DIA-NN modules. Resolves: #541
Add parsing of MSFragger's fragger.params file for search engine parameters. This provides additional QC information and serves as a fallback when fragpipe.workflow is not available. Changes: - Add fragger_params_reader() function to parse fragger.params files - Update get_workflow_parameters_table() to merge parameters from both workflow and fragger.params files (workflow takes precedence) - Add search pattern for *.params files - Add additional MSFragger-specific parameters to the parameters table: - Number of Threads - Decoy Prefix - Isotope Error - Mass Offsets - Precursor True Tolerance/Units - Calibrate Mass - Clip N-term Met - Min/Max Peptide Length The implementation follows the same graceful fallback pattern - if either file is available, parameters will be displayed. If both are available, workflow parameters take precedence with fragger.params providing additional information.
…ALIA Improve FragPipe support and file handling
📝 WalkthroughWalkthroughThis PR substantially expands FragPipe integration by introducing I/O handlers for workflow parameters, manifest files, and combined protein/peptide/ion outputs. It adds experiment-aware file grouping in the service layer, refactors shared plotting functions to accept data_type parameters, and updates test/documentation datasets from PXD066146 to PXD062399. Changes
Sequence DiagramsequenceDiagram
participant User
participant Service as pmultiqc_service.app
participant Discovery as File Discovery
participant Readers as fragpipe_io Readers
participant Module as FragPipeModule
participant Plots as Plotting Functions
User->>Service: Submit PRIDE job (FragPipe files)
Service->>Discovery: Scan downloaded files
Discovery-->>Service: Return fragpipe files list
Service->>Service: group_fragpipe_files() by experiment
loop For each experiment group
Service->>Readers: Load workflow, manifest, combined_*
Readers->>Readers: Parse/validate files
Readers-->>Module: Return structured data
Module->>Readers: Process workflow + fragger params
Readers-->>Module: parameters_table
Module->>Readers: Extract manifest
Readers-->>Module: experiment_design_table
Module->>Readers: Load combined_protein/peptide/ion
Readers-->>Module: protein_df, peptide_df, ion_df
Module->>Module: Compute distributions & metrics
Module->>Plots: Render parameters, experiment design
Module->>Plots: Render protein intensity distribution
Module->>Plots: Render MS/MS counts (via draw_oversampling)
Plots-->>User: MultiQC report with FragPipe sections
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@pmultiqc/modules/fragpipe/fragpipe_io.py`:
- Around line 761-793: The code initializes peptide-level entries in mbr_stats
but never fills them; either remove the 'peptides' initialization or implement
peptide-level MBR calculation analogous to the protein block: when a peptide
dataframe (e.g., peptide_df or peptides_df) is available, locate the matching
spectral count column in that dataframe (similar logic used for protein_df), set
intensity_col = sample, compute has_spectral, has_intensity, then compute
msms_only, mbr_only, both and assign them into mbr_stats[sample]['peptides'] as
ints (mirroring the structure used for mbr_stats[sample]['proteins']).
🧹 Nitpick comments (6)
pmultiqc/modules/fragpipe/fragpipe_io.py (3)
66-77: Consider using exact match forpsm.tsvfor consistency.Lines 68-75 use exact filename matching for
ion.tsv,combined_ion.tsv,combined_protein.tsv, andcombined_peptide.tsv. However, line 76 uses"psm" in filenamewhich could unintentionally match files likemypsm.tsvorpsm_backup.tsv.♻️ Suggested fix for consistent matching
elif filename == "combined_peptide.tsv": fragpipe_files["combined_peptide"].append(full_path) - elif "psm" in filename: + elif filename == "psm.tsv": fragpipe_files["psm"].append(full_path)
318-353: Consider extracting common key-value parsing logic.
fragger_params_readerandworkflow_readershare nearly identical parsing logic. Consider extracting a shared helper function.♻️ Optional refactor to reduce duplication
def _parse_key_value_file(file_path: str, file_type: str) -> dict | None: """Parse a key=value format file.""" parameters = {} try: with open(file_path, 'r', encoding='utf-8') as f: for line in f: line = line.strip() if not line or line.startswith('#'): continue if '=' in line: key, value = line.split('=', 1) parameters[key.strip()] = value.strip() except Exception as e: log.warning(f"Error reading {file_type} file {file_path}: {e}") return None log.info(f"Loaded {len(parameters)} parameters from {file_type} file") return parameters def workflow_reader(file_path: str): return _parse_key_value_file(file_path, "workflow") def fragger_params_reader(file_path: str): return _parse_key_value_file(file_path, "fragger.params")
356-361: Use explicitOptionalor union type for optional parameter.Per PEP 484, use explicit
Optional[dict]ordict | Noneinstead ofdict = None.♻️ Suggested fix
-def get_workflow_parameters_table(parameters: dict, fragger_params: dict = None): +def get_workflow_parameters_table(parameters: dict, fragger_params: dict | None = None):pmultiqc/modules/fragpipe/fragpipe.py (3)
202-225: Prefix unused unpacked variables with underscore.
mbr_cols(line 205) andmbr_info(line 222) are unpacked but never used. Prefix with underscore to indicate intentional non-use.♻️ Suggested fix
- self.protein_df, self.protein_intensity_cols, mbr_cols = combined_protein_reader(protein_path) + self.protein_df, self.protein_intensity_cols, _mbr_cols = combined_protein_reader(protein_path)- self.peptide_df, self.peptide_sample_cols, mbr_info = combined_peptide_reader(peptide_path) + self.peptide_df, self.peptide_sample_cols, _mbr_info = combined_peptide_reader(peptide_path)
1123-1127: Use explicitOptionalor union type for optional parameter.Per PEP 484, use explicit type annotation for optional parameters.
♻️ Suggested fix
`@staticmethod` - def draw_protein_intensity_distribution(sub_section, sample_distribution: dict, contam_distribution: dict = None): + def draw_protein_intensity_distribution(sub_section, sample_distribution: dict, contam_distribution: dict | None = None):
1212-1222: Consider usingany()for cleaner validation.The loop variable
sampleis unused. Consider usingany()with a generator expression.♻️ Suggested refactor
- # Check if we have any meaningful MBR data - has_data = False - for sample, stats in mbr_stats.items(): - proteins = stats.get('proteins', {}) - if proteins.get('mbr_only', 0) > 0 or proteins.get('both', 0) > 0: - has_data = True - break - - if not has_data: + # Check if we have any meaningful MBR data + has_data = any( + stats.get('proteins', {}).get('mbr_only', 0) > 0 or + stats.get('proteins', {}).get('both', 0) > 0 + for stats in mbr_stats.values() + ) + if not has_data:
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
pmultiqc/main.pypmultiqc/modules/fragpipe/fragpipe.pypmultiqc/modules/fragpipe/fragpipe_io.py
🧰 Additional context used
🧬 Code graph analysis (1)
pmultiqc/modules/fragpipe/fragpipe.py (1)
pmultiqc/modules/fragpipe/fragpipe_io.py (11)
workflow_reader(282-315)fragger_params_reader(318-353)get_workflow_parameters_table(356-461)manifest_reader(464-508)get_experiment_design_table(511-544)combined_protein_reader(547-611)get_protein_intensity_distribution(614-671)combined_peptide_reader(674-726)get_mbr_stats(729-793)combined_ion_reader(796-864)get_msms_counts_per_peak(867-917)
🪛 Ruff (0.14.11)
pmultiqc/modules/fragpipe/fragpipe.py
171-171: Do not catch blind exception: Exception
(BLE001)
181-181: Do not catch blind exception: Exception
(BLE001)
198-198: Do not catch blind exception: Exception
(BLE001)
205-205: Unpacked variable mbr_cols is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
215-215: Do not catch blind exception: Exception
(BLE001)
222-222: Unpacked variable mbr_info is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
224-224: Do not catch blind exception: Exception
(BLE001)
234-234: Do not catch blind exception: Exception
(BLE001)
247-247: Do not catch blind exception: Exception
(BLE001)
1124-1124: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
1214-1214: Loop control variable sample not used within loop body
(B007)
pmultiqc/modules/fragpipe/fragpipe_io.py
310-310: Do not catch blind exception: Exception
(BLE001)
348-348: Do not catch blind exception: Exception
(BLE001)
356-356: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
504-504: Consider moving this statement to an else block
(TRY300)
506-506: Do not catch blind exception: Exception
(BLE001)
570-570: Do not catch blind exception: Exception
(BLE001)
696-696: Do not catch blind exception: Exception
(BLE001)
817-817: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: test_tmt
- GitHub Check: test_fragpipe
- GitHub Check: test_maxquant_dia
- GitHub Check: test_maxquant
- GitHub Check: test_dia
- GitHub Check: test_diann
- GitHub Check: test_lfq
- GitHub Check: test_proteobench
- GitHub Check: test_mzid_mgf
- GitHub Check: test_mzid_mzML
- GitHub Check: Codacy Static Code Analysis
🔇 Additional comments (18)
pmultiqc/modules/fragpipe/fragpipe_io.py (8)
21-34: LGTM!The REQUIRED_COLS dictionary is properly extended with the new file types (combined_protein, combined_peptide, combined_ion). The column definitions align with FragPipe output file specifications.
282-315: LGTM!The
workflow_readerfunction correctly parses key=value format files with proper handling for comments and empty lines. The broad exception handling (flagged by static analysis) is acceptable here as graceful degradation is the intended behavior for optional file loading.
464-508: LGTM!The
manifest_readerhandles variable-column manifest files correctly with dynamic column assignment. The error handling with graceful fallback toNoneis appropriate for optional file loading.
547-611: LGTM with note on MBR column parsing.The function correctly loads protein data and identifies sample intensity columns. The MBR column parsing logic at lines 598-607 uses string replacement to extract sample base names, which may not handle all naming conventions, but the fallback behavior is safe.
614-671: LGTM!The function correctly separates sample and contaminant proteins, applies log2 transformation with proper filtering of non-positive values, and handles edge cases gracefully.
796-864: LGTM!The function correctly reads combined_ion.tsv with robust sample column identification using both anchor-based and pattern-matching fallback approaches. Consistent with the existing
ion_readerimplementation.
867-917: LGTM!The function provides two modes of operation based on available data: detailed spectral count statistics or intensity-based identification counts. The fallback approach ensures useful output even when spectral count columns are unavailable.
530-543: Bug: Incorrect column existence check in row iteration.The checks like
if 'experiment' in rowtest whether the string'experiment'is a value in the row, not whether the column exists. Since you're iterating withiterrows(),rowis a pandas Series whereinchecks values, not index labels.🐛 Proposed fix
for idx, row in manifest_df.iterrows(): file_name = row.get('file_name', row.get('file_path', f'File_{idx}')) entry = { "file_name": file_name, } - if 'experiment' in row: + if 'experiment' in manifest_df.columns: entry["experiment"] = row['experiment'] - if 'bioreplicate' in row: + if 'bioreplicate' in manifest_df.columns: entry["bioreplicate"] = row['bioreplicate'] - if 'data_type' in row: + if 'data_type' in manifest_df.columns: entry["data_type"] = row['data_type'] table_data[idx + 1] = entryLikely an incorrect or invalid review comment.
pmultiqc/modules/fragpipe/fragpipe.py (7)
100-127: LGTM!New data containers are properly initialized for all the additional FragPipe file types. The attribute grouping with comments improves code organization.
256-269: LGTM!Parameters and experiment design tables are properly integrated with appropriate order values (1 and 2) ensuring they appear at the top of the experiment section.
380-400: LGTM!New visualizations for protein intensity distribution, MBR contribution, and MS/MS counts are properly guarded with data availability checks before drawing.
402-416: Good defensive coding with None filtering.Using
.get("experiment")and filtering out None values fromsection_group_dictprevents KeyError if the experiment section is not configured.
1021-1068: LGTM!The
draw_parametersmethod follows the established pattern for table rendering with appropriate configuration options.
1070-1121: LGTM!The
draw_experiment_designmethod is well-structured with descriptive headers and appropriate helptext.
1292-1413: LGTM!The
draw_msms_countsmethod handles two data formats appropriately with distinct visualizations. The data availability checks ensure safe dictionary access.pmultiqc/main.py (3)
155-161: LGTM!The workflow file pattern registration follows the established convention and enables proper file discovery for FragPipe configuration files.
162-167: LGTM!The fp-manifest pattern registration is appropriately specific for FragPipe manifest files.
155-174: The*.paramspattern is appropriate for this use case.The broad pattern
*.paramsat the discovery level is intentional and safe. The fragpipe module explicitly filters discovered files to only process the exact filenamefragger.params(seefragpipe_io.pyline 96:if filename == "fragger.params"), preventing any false positives from being processed.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| for sample in sample_cols: | ||
| mbr_stats[sample] = { | ||
| 'proteins': {'msms_only': 0, 'mbr_only': 0, 'both': 0}, | ||
| 'peptides': {'msms_only': 0, 'mbr_only': 0, 'both': 0} | ||
| } | ||
|
|
||
| # Protein-level MBR stats | ||
| if protein_df is not None: | ||
| spectral_col = None | ||
| intensity_col = sample | ||
|
|
||
| # Find corresponding spectral count column | ||
| for col in protein_df.columns: | ||
| if sample.replace(' MaxLFQ Intensity', '').replace(' Intensity', '') in col: | ||
| if 'spectral count' in col.lower(): | ||
| spectral_col = col | ||
| break | ||
|
|
||
| if spectral_col and intensity_col in protein_df.columns: | ||
| has_spectral = protein_df[spectral_col] > 0 | ||
| has_intensity = protein_df[intensity_col] > 0 | ||
|
|
||
| msms_only = ((has_spectral) & (~has_intensity)).sum() | ||
| mbr_only = ((~has_spectral) & (has_intensity)).sum() | ||
| both = ((has_spectral) & (has_intensity)).sum() | ||
|
|
||
| mbr_stats[sample]['proteins'] = { | ||
| 'msms_only': int(msms_only), | ||
| 'mbr_only': int(mbr_only), | ||
| 'both': int(both) | ||
| } | ||
|
|
||
| return mbr_stats |
There was a problem hiding this comment.
Peptide-level MBR statistics are initialized but never calculated.
The function initializes mbr_stats[sample]['peptides'] with zeros at line 764, but there's no corresponding calculation for peptide-level MBR stats (only protein-level stats are computed at lines 768-791). Either implement the peptide calculation or remove the misleading initialization.
💡 Option 1: Remove unused peptide stats
mbr_stats[sample] = {
- 'proteins': {'msms_only': 0, 'mbr_only': 0, 'both': 0},
- 'peptides': {'msms_only': 0, 'mbr_only': 0, 'both': 0}
+ 'proteins': {'msms_only': 0, 'mbr_only': 0, 'both': 0}
}Or implement peptide-level MBR calculations if this data is needed for future visualization.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for sample in sample_cols: | |
| mbr_stats[sample] = { | |
| 'proteins': {'msms_only': 0, 'mbr_only': 0, 'both': 0}, | |
| 'peptides': {'msms_only': 0, 'mbr_only': 0, 'both': 0} | |
| } | |
| # Protein-level MBR stats | |
| if protein_df is not None: | |
| spectral_col = None | |
| intensity_col = sample | |
| # Find corresponding spectral count column | |
| for col in protein_df.columns: | |
| if sample.replace(' MaxLFQ Intensity', '').replace(' Intensity', '') in col: | |
| if 'spectral count' in col.lower(): | |
| spectral_col = col | |
| break | |
| if spectral_col and intensity_col in protein_df.columns: | |
| has_spectral = protein_df[spectral_col] > 0 | |
| has_intensity = protein_df[intensity_col] > 0 | |
| msms_only = ((has_spectral) & (~has_intensity)).sum() | |
| mbr_only = ((~has_spectral) & (has_intensity)).sum() | |
| both = ((has_spectral) & (has_intensity)).sum() | |
| mbr_stats[sample]['proteins'] = { | |
| 'msms_only': int(msms_only), | |
| 'mbr_only': int(mbr_only), | |
| 'both': int(both) | |
| } | |
| return mbr_stats | |
| for sample in sample_cols: | |
| mbr_stats[sample] = { | |
| 'proteins': {'msms_only': 0, 'mbr_only': 0, 'both': 0} | |
| } | |
| # Protein-level MBR stats | |
| if protein_df is not None: | |
| spectral_col = None | |
| intensity_col = sample | |
| # Find corresponding spectral count column | |
| for col in protein_df.columns: | |
| if sample.replace(' MaxLFQ Intensity', '').replace(' Intensity', '') in col: | |
| if 'spectral count' in col.lower(): | |
| spectral_col = col | |
| break | |
| if spectral_col and intensity_col in protein_df.columns: | |
| has_spectral = protein_df[spectral_col] > 0 | |
| has_intensity = protein_df[intensity_col] > 0 | |
| msms_only = ((has_spectral) & (~has_intensity)).sum() | |
| mbr_only = ((~has_spectral) & (has_intensity)).sum() | |
| both = ((has_spectral) & (has_intensity)).sum() | |
| mbr_stats[sample]['proteins'] = { | |
| 'msms_only': int(msms_only), | |
| 'mbr_only': int(mbr_only), | |
| 'both': int(both) | |
| } | |
| return mbr_stats |
🤖 Prompt for AI Agents
In `@pmultiqc/modules/fragpipe/fragpipe_io.py` around lines 761 - 793, The code
initializes peptide-level entries in mbr_stats but never fills them; either
remove the 'peptides' initialization or implement peptide-level MBR calculation
analogous to the protein block: when a peptide dataframe (e.g., peptide_df or
peptides_df) is available, locate the matching spectral count column in that
dataframe (similar logic used for protein_df), set intensity_col = sample,
compute has_spectral, has_intensity, then compute msms_only, mbr_only, both and
assign them into mbr_stats[sample]['peptides'] as ints (mirroring the structure
used for mbr_stats[sample]['proteins']).
…eports
When processing PRIDE datasets containing multiple FragPipe files (e.g.,
psm.tsv and ion.tsv), the service was creating separate reports for each
file. This fix groups FragPipe files by their experiment prefix so that
related files are processed together as a single experiment.
Changes:
- Add FRAGPIPE_FILE_SUFFIXES constant listing all FragPipe file types
- Add group_fragpipe_files() function to group files by experiment prefix
- Modify PRIDE processing loop to process grouped files together
- Files without prefix (psm.tsv, ion.tsv) are grouped as "fragpipe_experiment"
- Files with prefix ({exp}_psm.tsv, {exp}_ion.tsv) are grouped by prefix
…s-Aq8jP Fix duplicate reports for ion and psm files
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@pmultiqc_service/app.py`:
- Around line 838-840: The current loop uses the filename stem (file_name =
os.path.splitext(os.path.basename(file_path))[0]) as the group key and assigns
groups[file_name] = [file_path], which overwrites prior entries with the same
stem; change the logic in the loop that processes non_fragpipe_files so it uses
the full filename (e.g., os.path.basename(file_path)) as the key or, if you must
keep the stem, check if groups already has that key and append file_path to the
existing list instead of replacing it (i.e., if key in groups:
groups[key].append(file_path) else: groups[key] = [file_path]); update the block
that sets file_name and assigns into groups accordingly.
🧹 Nitpick comments (1)
pmultiqc_service/app.py (1)
768-780: Group FragPipe config/manifest files with the experiment (if they’re downloaded).Line 768–780 only covers TSV outputs. If
fragpipe.workflow,fragger.params, orfp-manifestare part of the download set, they’ll be treated as standalone groups and won’t be present when the FragPipe group runs, so parameter parsing can be skipped. Consider adding these suffixes (and, if they’re global, copying them into each group).♻️ Possible tweak
FRAGPIPE_FILE_SUFFIXES = [ + "fragpipe.workflow", + "fragger.params", + "fp-manifest", "combined_modified_peptide.tsv", "combined_peptide.tsv", "combined_protein.tsv", "combined_ion.tsv", "peptide.tsv", "protein.tsv", "psm.tsv", "ion.tsv", ]
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
pmultiqc_service/app.py
🧰 Additional context used
🪛 Ruff (0.14.11)
pmultiqc_service/app.py
1421-1421: Do not catch blind exception: Exception
(BLE001)
1461-1461: Do not catch blind exception: Exception
(BLE001)
1462-1462: Use logging.exception instead of logging.error
Replace with exception
(TRY400)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: test_maxquant_dia
- GitHub Check: test_maxquant
- GitHub Check: test_diann
- GitHub Check: test_fragpipe
- GitHub Check: test_dia
- GitHub Check: test_mzid_mzML
- GitHub Check: test_tmt
- GitHub Check: test_mzid_mgf
- GitHub Check: test_lfq
- GitHub Check: test_proteobench
- GitHub Check: Codacy Static Code Analysis
🔇 Additional comments (1)
pmultiqc_service/app.py (1)
1351-1457: Group-based processing flow looks solid.The per-group extraction, input-type detection, and progress reporting in Lines 1351–1457 align well with the new grouping strategy.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| for file_path in non_fragpipe_files: | ||
| file_name = os.path.splitext(os.path.basename(file_path))[0] | ||
| groups[file_name] = [file_path] |
There was a problem hiding this comment.
Avoid overwriting non‑FragPipe groups with the same stem.
Line 839–840 uses the filename stem as the group key and overwrites any prior entry; files like sample.mgf and sample.mzML would collapse into one group, dropping one file. Use the full filename and append when the key already exists.
🐛 Proposed fix
- for file_path in non_fragpipe_files:
- file_name = os.path.splitext(os.path.basename(file_path))[0]
- groups[file_name] = [file_path]
+ for file_path in non_fragpipe_files:
+ file_name = os.path.basename(file_path)
+ groups.setdefault(file_name, []).append(file_path)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for file_path in non_fragpipe_files: | |
| file_name = os.path.splitext(os.path.basename(file_path))[0] | |
| groups[file_name] = [file_path] | |
| for file_path in non_fragpipe_files: | |
| file_name = os.path.basename(file_path) | |
| groups.setdefault(file_name, []).append(file_path) |
🤖 Prompt for AI Agents
In `@pmultiqc_service/app.py` around lines 838 - 840, The current loop uses the
filename stem (file_name = os.path.splitext(os.path.basename(file_path))[0]) as
the group key and assigns groups[file_name] = [file_path], which overwrites
prior entries with the same stem; change the logic in the loop that processes
non_fragpipe_files so it uses the full filename (e.g.,
os.path.basename(file_path)) as the key or, if you must keep the stem, check if
groups already has that key and append file_path to the existing list instead of
replacing it (i.e., if key in groups: groups[key].append(file_path) else:
groups[key] = [file_path]); update the block that sets file_name and assigns
into groups accordingly.
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
pmultiqc/modules/common/file_utils.py (2)
156-157: Missing exception handling forrarfileerrors.The exception handler doesn't catch
rarfile.BadRarFileorrarfile.Error, which would be raised for corrupt or invalid RAR files.Proposed fix
- except (zipfile.BadZipFile, tarfile.TarError, gzip.BadGzipFile, OSError, ValueError) as e: + except (zipfile.BadZipFile, tarfile.TarError, gzip.BadGzipFile, rarfile.BadRarFile, rarfile.Error, OSError, ValueError) as e:
166-168: Missing.rarin archive types list.The
is_archive_filefunction doesn't include.rarin the recognized archive types, creating an inconsistency withextract_archive_filewhich now handles RAR files.Proposed fix
def is_archive_file(file_path: str) -> bool: - archive_types = [".zip", ".gz", ".tar", ".tar.gz", ".tgz", ".tar.bz2"] + archive_types = [".zip", ".gz", ".tar", ".tar.gz", ".tgz", ".tar.bz2", ".rar"] return any(file_path.lower().endswith(arch_type) for arch_type in archive_types)pmultiqc/modules/maxquant/maxquant_plots.py (1)
446-479: Guard unknowndata_typeto avoid UnboundLocalError.
description_text/help_textare only set in the two explicit branches, so any other value (including case mismatches) will crash atadd_sub_section. Initialize defaults and normalizedata_typedefensively.🛠️ Proposed fix
- if data_type == "maxquant": + data_type_norm = (data_type or "").lower() + description_text = "" + help_text = "" + if data_type_norm == "maxquant": description_text = """ [Excludes Contaminants] Number of unique (i.e. not counted twice) peptide sequences including modifications (after FDR) per Raw file. """ help_text=""" If MBR was enabled, three categories ('Genuine (Exclusive)', 'Genuine + Transferred', 'Transferred (Exclusive)' are shown, so the user can judge the gain that MBR provides. ... """ - elif data_type == "fragpipe": + elif data_type_norm == "fragpipe": description_text = """ combined_peptide.tsv """ help_text = """ combined_peptide.tsv """ + else: + description_text = "[Excludes Contaminants] Number of unique peptide sequences per file." + help_text = "Counts after FDR filtering."
🤖 Fix all issues with AI agents
In `@pmultiqc/modules/common/file_utils.py`:
- Around line 128-131: The extract_rar function currently calls
rarfile.RarFile(...).extractall without the same security checks as
extract_zip/extract_tar; update extract_rar to mirror those protections by
iterating through rf.infolist()/rf.namelist(), computing the resolved target
path for each member, and ensuring the resolved path is inside the provided
root_dir (preventing path traversal), disallowing/sanitizing absolute paths and
symlinks, and enforcing archive-bomb limits (e.g., max total uncompressed bytes
and max file count) before extraction; if any check fails, raise an exception
and do not extract. Use the same helper logic/function names used by
extract_zip/extract_tar (reuse or extract shared helpers) to keep behavior
consistent.
In `@pmultiqc/modules/fragpipe/fragpipe_io.py`:
- Around line 846-870: validate_columns_existence currently mutates the shared
REQUIRED_KEYWORDS entry for data_name which causes stale state across calls; to
fix, create a fresh per-call copy of the required keywords at the start of
validate_columns_existence (e.g., local_required =
REQUIRED_KEYWORDS[data_name].copy() or use dict(...) / deepcopy as appropriate)
and then use local_required for all mutations and the final
all(local_required.values()) and reporting, leaving the global REQUIRED_KEYWORDS
unchanged.
- Around line 930-969: In cal_peptide_id_gain, guard against division by zero
when computing count_df["MBRgain"]: replace the current raw division using
count_df["mbr"] / count_df["ms/ms_count"] with a safe computation that avoids
inf/NaN (e.g., compute per-row MBRgain only when "ms/ms_count" > 0, otherwise
set NaN or 0), then sanitize infinite values (np.inf) to NaN and compute
mbr_gain using .dropna().mean() before rounding; finally build title_value only
when mbr_gain is a finite number. Reference: function cal_peptide_id_gain,
DataFrame count_df, column "ms/ms_count", "mbr", and the computed "MBRgain".
♻️ Duplicate comments (1)
pmultiqc/modules/fragpipe/fragpipe_io.py (1)
769-799: Peptide-level MBR stats are initialized but never populated.
mbr_stats[sample]["peptides"]stays zeroed because no peptide-level calculation exists. Either compute it or drop the field to avoid misleading output.🧹 Minimal fix (remove unused peptides key)
- mbr_stats[sample] = { - 'proteins': {'msms_only': 0, 'mbr_only': 0, 'both': 0}, - 'peptides': {'msms_only': 0, 'mbr_only': 0, 'both': 0} - } + mbr_stats[sample] = { + 'proteins': {'msms_only': 0, 'mbr_only': 0, 'both': 0} + }
🧹 Nitpick comments (1)
.github/workflows/python-app.yml (1)
272-281: Code changes are correct; consider documenting the unrar dependency for local users.The workflow correctly installs
unrarand the code uses it properly. However,rarfilerequires an external decompressor (unrar,unar,7z, orbsdtar) to be installed on the system. While the CI handles this, users installing locally without this dependency will encounter errors. The current setup instructions inCONTRIBUTING.mddon't mention this external tool requirement—adding a note would improve the developer experience.
| def validate_columns_existence(df_columns, data_name: str): | ||
|
|
||
| col_list = list(df_columns) | ||
|
|
||
| required_keywords = REQUIRED_KEYWORDS[data_name] | ||
|
|
||
| for col in col_list: | ||
| if "Spectral Count" in col: | ||
| required_keywords["Spectral Count"] = True | ||
| if "Match Type" in col: | ||
| required_keywords["Match Type"] = True | ||
| if "Intensity" in col: | ||
| if "MaxLFQ" not in col: | ||
| required_keywords["Intensity"] = True | ||
|
|
||
| all_passed = all(required_keywords.values()) | ||
|
|
||
| print(f"Check whether the data {data_name} meets the extraction requirements.") | ||
|
|
||
|
|
||
| for key, found in required_keywords.items(): | ||
| status = "exists" if found else "is missing" | ||
| print(f"{key}: {status}") | ||
|
|
||
| return all_passed |
There was a problem hiding this comment.
Reset REQUIRED_KEYWORDS per call to avoid stale validation.
validate_columns_existence mutates the shared REQUIRED_KEYWORDS dict, so later calls can incorrectly pass even if required columns are missing. Use a fresh dict per call.
🛠️ Proposed fix
- required_keywords = REQUIRED_KEYWORDS[data_name]
+ required_keywords = {k: False for k in REQUIRED_KEYWORDS[data_name].keys()}🤖 Prompt for AI Agents
In `@pmultiqc/modules/fragpipe/fragpipe_io.py` around lines 846 - 870,
validate_columns_existence currently mutates the shared REQUIRED_KEYWORDS entry
for data_name which causes stale state across calls; to fix, create a fresh
per-call copy of the required keywords at the start of
validate_columns_existence (e.g., local_required =
REQUIRED_KEYWORDS[data_name].copy() or use dict(...) / deepcopy as appropriate)
and then use local_required for all mutations and the final
all(local_required.values()) and reporting, leaving the global REQUIRED_KEYWORDS
unchanged.
The extract_rar function was missing security validations present in extract_zip and extract_tar. This adds: - Path traversal detection (../ patterns, absolute paths) - Archive bomb detection (compression ratio > 100:1 or > 10GB uncompressed) - Proper type hints and logging - rarfile.Error to the exception handling in extract_archive_file
…y-BCoTR Fix security issue in RAR file extraction
There was a problem hiding this comment.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pmultiqc/modules/common/file_utils.py (1)
201-203: Missing.rarin archive type detection.The
is_archive_filefunction doesn't include.rarinarchive_types, butextract_archive_filesupports RAR extraction. Whenis_archive_fileis used inmain.py:66to determine whether to extract submitted files, RAR archives will not be detected and won't be extracted.🔧 Proposed fix
def is_archive_file(file_path: str) -> bool: - archive_types = [".zip", ".gz", ".tar", ".tar.gz", ".tgz", ".tar.bz2"] + archive_types = [".zip", ".gz", ".tar", ".tar.gz", ".tgz", ".tar.bz2", ".rar"] return any(file_path.lower().endswith(arch_type) for arch_type in archive_types)
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #544 +/- ##
========================================
- Coverage 8.63% 8.39% -0.25%
========================================
Files 44 44
Lines 5793 6233 +440
========================================
+ Hits 500 523 +23
- Misses 5293 5710 +417 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
pmultiqc/modules/common/file_utils.py (1)
127-146: Remove undefinedrarfile.Errorfrom exception handling.Line 145 references
rarfile.Errorin the except clause, butrarfileis not imported and is not declared as a dependency anywhere in the repository. This causes an undefined name error (F821) and will raise aNameErrorat runtime. Removerarfile.Errorfrom the exception tuple unless RAR support is being actively implemented with the dependency properly declared and imported.- except (zipfile.BadZipFile, tarfile.TarError, gzip.BadGzipFile, rarfile.Error, OSError, ValueError) as e: + except (zipfile.BadZipFile, tarfile.TarError, gzip.BadGzipFile, OSError, ValueError) as e: raise ValueError(f"Failed to extract {file_path}: {e}") from epmultiqc/modules/fragpipe/fragpipe_io.py (1)
174-189: Convert object-dtype sample intensity columns to numeric in-place.Line 184 validates that object columns can be converted to numeric but does not apply the conversion, leaving them as strings in the returned dataframe. Downstream at line 231,
get_ion_intensity_data()performsintensities[intensities > 0]comparisons, which will raiseTypeErroror behave unexpectedly on string-typed columns.Assign the result of
pd.to_numeric()back to the dataframe column and verify the conversion produced valid numeric values:🔧 Proposed fix
- elif ion_df[col].dtype == object: - # Try to convert to numeric - try: - pd.to_numeric(ion_df[col], errors='raise') - valid_sample_cols.append(col) - except (ValueError, TypeError): - pass + elif ion_df[col].dtype == object: + # Try to convert to numeric + try: + ion_df[col] = pd.to_numeric(ion_df[col], errors='coerce') + if ion_df[col].notna().any(): + valid_sample_cols.append(col) + except (ValueError, TypeError): + pass
🤖 Fix all issues with AI agents
In `@pmultiqc/modules/fragpipe/fragpipe_io.py`:
- Around line 675-689: The current check uses pd.to_numeric(sample_df[col],
errors='coerce').notna().all() which skips any column that has a single
non-numeric/missing value; instead coerce to numeric and skip only if the entire
column becomes empty. Replace the all() check with logic that converts the
column to numeric, drops NA (e.g., series_numeric =
pd.to_numeric(sample_df[col], errors='coerce').dropna()), continue only when
series_numeric is empty, and then use series_numeric (not the original
sample_df[col]) to compute intensities and contaminant handling for
sample_distribution and contaminant_distribution so partial missingness does not
remove valid data.
- Around line 576-583: Docstrings for combined_protein_reader,
combined_peptide_reader, and combined_ion_reader are out of sync with their
actual return values: each function returns two items (e.g., (protein_df,
sample_intensity_cols) or (peptide_df, sample_intensity_cols) or (ion_df,
validate_columns)) but the docstrings still describe a third item (mbr_cols /
mbr_info / sample_cols). Update each function's Returns section to list only the
two actual returned items and their meanings (remove
mbr_cols/mbr_info/sample_cols descriptions), explicitly name the returned
variables as used in the code (e.g., protein_df and sample_intensity_cols for
combined_protein_reader; peptide_df and sample_intensity_cols for
combined_peptide_reader; ion_df and validate_columns for combined_ion_reader) so
the documentation matches the implementation and call sites.
♻️ Duplicate comments (2)
pmultiqc/modules/fragpipe/fragpipe_io.py (2)
769-799: Peptide MBR stats are initialized but never computed.Line 771–798 initializes
mbr_stats[sample]['peptides']but no peptide-level calculation is performed. Either compute peptide MBR stats or remove the placeholder to avoid misleading output.
846-861: Avoid mutating global REQUIRED_KEYWORDS.Line 850:
required_keywords = REQUIRED_KEYWORDS[data_name]mutates the shared dict and can cause stale state across calls (later validations can pass incorrectly). Use a per-call copy.🔧 Proposed fix
- required_keywords = REQUIRED_KEYWORDS[data_name] + required_keywords = {k: False for k in REQUIRED_KEYWORDS[data_name].keys()}
| Returns | ||
| ------- | ||
| tuple | ||
| (protein_df, sample_intensity_cols, mbr_cols) where: | ||
| - protein_df: DataFrame with protein-level data | ||
| - sample_intensity_cols: List of sample intensity column names | ||
| - mbr_cols: Dict mapping sample to MBR-related columns | ||
| """ |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, inspect the file to understand structure
wc -l pmultiqc/modules/fragpipe/fragpipe_io.py
# Find the combined_*_reader functions and their return statements
rg -nP '\bcombined_(protein|peptide|ion)_reader' pmultiqc/modules/fragpipe/fragpipe_io.py -A 5 -B 2Repository: bigbio/pmultiqc
Length of output: 174
🏁 Script executed:
#!/bin/bash
# Inspect specific lines for docstrings and returns
echo "=== Lines 576-583 (combined_protein_reader docstring) ==="
sed -n '576,583p' pmultiqc/modules/fragpipe/fragpipe_io.py
echo "=== Lines 625-635 (combined_protein_reader return) ==="
sed -n '625,635p' pmultiqc/modules/fragpipe/fragpipe_io.py
echo "=== Lines 706-735 (combined_peptide_reader docstring) ==="
sed -n '706,735p' pmultiqc/modules/fragpipe/fragpipe_io.py
echo "=== Lines 818-844 (combined_ion_reader docstring) ==="
sed -n '818,844p' pmultiqc/modules/fragpipe/fragpipe_io.pyRepository: bigbio/pmultiqc
Length of output: 2513
🏁 Script executed:
#!/bin/bash
# Find all call sites of combined_*_reader functions
echo "=== Call sites ==="
rg -nP '\bcombined_(protein|peptide|ion)_reader\s*\(' pmultiqc/modules/fragpipe/fragpipe_io.py -C 3Repository: bigbio/pmultiqc
Length of output: 598
🏁 Script executed:
#!/bin/bash
# Search for function calls (not definitions)
echo "=== Calls to combined_protein_reader ==="
rg 'combined_protein_reader' pmultiqc/modules/fragpipe/fragpipe_io.py -n | grep -v 'def combined_protein_reader'
echo "=== Calls to combined_peptide_reader ==="
rg 'combined_peptide_reader' pmultiqc/modules/fragpipe/fragpipe_io.py -n | grep -v 'def combined_peptide_reader'
echo "=== Calls to combined_ion_reader ==="
rg 'combined_ion_reader' pmultiqc/modules/fragpipe/fragpipe_io.py -n | grep -v 'def combined_ion_reader'Repository: bigbio/pmultiqc
Length of output: 381
🏁 Script executed:
#!/bin/bash
# Search across entire fragpipe module for calls
echo "=== All calls to combined_*_reader ==="
rg 'combined_(protein|peptide|ion)_reader' pmultiqc/modules/fragpipe/ -n | grep -v 'def combined_'Repository: bigbio/pmultiqc
Length of output: 1096
🏁 Script executed:
#!/bin/bash
# Get the exact return statements for each function
echo "=== combined_protein_reader function definition and return ==="
sed -n '563,630p' pmultiqc/modules/fragpipe/fragpipe_io.py | tail -20
echo "=== combined_peptide_reader function definition and return ==="
sed -n '694,730p' pmultiqc/modules/fragpipe/fragpipe_io.py | tail -20
echo "=== combined_ion_reader function definition and return ==="
sed -n '804,845p' pmultiqc/modules/fragpipe/fragpipe_io.py | tail -20Repository: bigbio/pmultiqc
Length of output: 2242
🏁 Script executed:
#!/bin/bash
# Check the fragpipe.py file to see the exact unpacking at call sites
echo "=== fragpipe.py line 202 context ==="
sed -n '200,205p' pmultiqc/modules/fragpipe/fragpipe.py
echo "=== fragpipe.py line 219 context ==="
sed -n '217,222p' pmultiqc/modules/fragpipe/fragpipe.py
echo "=== fragpipe.py line 233 context ==="
sed -n '231,236p' pmultiqc/modules/fragpipe/fragpipe.pyRepository: bigbio/pmultiqc
Length of output: 1209
Fix docstring return contracts for combined_*_reader functions.
Docstrings incorrectly document return values:
combined_protein_readerdocstring claims 3 returns (includingmbr_cols) but only returns 2combined_peptide_readerdocstring claims 3 returns (includingmbr_info) but only returns 2combined_ion_readerdocstring documentssample_colsbut actually returnsvalidate_columns
Call sites in fragpipe.py correctly unpack 2 values, so no runtime errors occur. Update docstrings to match actual returns.
🤖 Prompt for AI Agents
In `@pmultiqc/modules/fragpipe/fragpipe_io.py` around lines 576 - 583, Docstrings
for combined_protein_reader, combined_peptide_reader, and combined_ion_reader
are out of sync with their actual return values: each function returns two items
(e.g., (protein_df, sample_intensity_cols) or (peptide_df,
sample_intensity_cols) or (ion_df, validate_columns)) but the docstrings still
describe a third item (mbr_cols / mbr_info / sample_cols). Update each
function's Returns section to list only the two actual returned items and their
meanings (remove mbr_cols/mbr_info/sample_cols descriptions), explicitly name
the returned variables as used in the code (e.g., protein_df and
sample_intensity_cols for combined_protein_reader; peptide_df and
sample_intensity_cols for combined_peptide_reader; ion_df and validate_columns
for combined_ion_reader) so the documentation matches the implementation and
call sites.
| if not pd.to_numeric(sample_df[col], errors='coerce').notna().all(): | ||
| continue | ||
|
|
||
| # Sample intensities | ||
| intensities = sample_df[col].dropna() | ||
| valid_intensities = intensities[intensities > 0] | ||
| if len(valid_intensities) > 0: | ||
| sample_distribution[col] = np.log2(valid_intensities).tolist() | ||
|
|
||
| # Contaminant intensities | ||
| if not contam_df.empty and col in contam_df.columns: | ||
| cont_intensities = contam_df[col].dropna() | ||
| valid_cont = cont_intensities[cont_intensities > 0] | ||
| if len(valid_cont) > 0: | ||
| contaminant_distribution[col] = np.log2(valid_cont).tolist() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check if the file exists and read the relevant lines
if [ -f "pmultiqc/modules/fragpipe/fragpipe_io.py" ]; then
echo "=== File found. Reading lines 670-695 for context ==="
sed -n '670,695p' pmultiqc/modules/fragpipe/fragpipe_io.py | cat -n
else
echo "File not found at expected location. Searching for it..."
find . -name "fragpipe_io.py" -type f
fiRepository: bigbio/pmultiqc
Length of output: 1172
Avoid dropping entire columns when any single intensity value is missing.
Line 675–676: The check notna().all() skips columns if any value is missing or non-numeric. This is overly strict and silently removes valid samples from QC plots—common in real proteomics datasets with partial missingness. Only skip columns that are completely empty after numeric coercion.
🔧 Proposed fix
- if not pd.to_numeric(sample_df[col], errors='coerce').notna().all():
- continue
-
- # Sample intensities
- intensities = sample_df[col].dropna()
+ values = pd.to_numeric(sample_df[col], errors='coerce')
+ if values.notna().sum() == 0:
+ continue
+
+ # Sample intensities
+ intensities = values.dropna()
valid_intensities = intensities[intensities > 0]
if len(valid_intensities) > 0:
sample_distribution[col] = np.log2(valid_intensities).tolist()
@@
- cont_intensities = contam_df[col].dropna()
+ cont_intensities = pd.to_numeric(contam_df[col], errors='coerce').dropna()
valid_cont = cont_intensities[cont_intensities > 0]🤖 Prompt for AI Agents
In `@pmultiqc/modules/fragpipe/fragpipe_io.py` around lines 675 - 689, The current
check uses pd.to_numeric(sample_df[col], errors='coerce').notna().all() which
skips any column that has a single non-numeric/missing value; instead coerce to
numeric and skip only if the entire column becomes empty. Replace the all()
check with logic that converts the column to numeric, drops NA (e.g.,
series_numeric = pd.to_numeric(sample_df[col], errors='coerce').dropna()),
continue only when series_numeric is empty, and then use series_numeric (not the
original sample_df[col]) to compute intensities and contaminant handling for
sample_distribution and contaminant_distribution so partial missingness does not
remove valid data.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@pmultiqc/modules/fragpipe/fragpipe.py`:
- Around line 229-240: The success log "Combined ion data loaded successfully."
is emitted before you verify combined_ion_df and combined_ion_column_valid; move
or change the log so it only runs after the validation check. In the try block
call combined_ion_reader(combined_ion_path) as you do, then if combined_ion_df
is not None and combined_ion_column_valid set self.msms_counts via
get_msms_counts_per_peak and call log.info("Combined ion data loaded
successfully.") there; otherwise emit a warning or skip logging. Update
references to combined_ion_reader, combined_ion_df, combined_ion_column_valid,
get_msms_counts_per_peak, self.msms_counts, and log.info accordingly.
- Around line 215-227: The success log is printed before validating the parsed
DataFrame and contains a typo; move the "Combined peptide data loaded
successfully." log so it runs only after confirming peptide_df is not None and
combined_peptide_column_valid, change the other log message in the same block
from "peptide statics loaded successfully." to "peptide statistics loaded
successfully.", and ensure these messages are emitted around the
combined_peptide_reader call and the cal_peptide_id_gain assignment (referencing
combined_peptide_reader, peptide_df, combined_peptide_column_valid,
cal_peptide_id_gain, and self.peptide_id_gain) while keeping the existing
exception log (log.warning) intact.
- Around line 390-404: Either remove the unused get_mbr_stats import and the
commented draw_mbr_contribution block, or populate self.mbr_stats in get_data()
by calling get_mbr_stats(...) with the appropriate protein/peptide DataFrames
and then enable the visualization; specifically, locate the get_data() method in
fragpipe.py and set self.mbr_stats = get_mbr_stats(protein_df, peptide_df) (or
similar variable names used there), ensure get_mbr_stats is imported, and then
uncomment/restore the draw_mbr_contribution(...) call that references
self.mbr_stats and draw_mbr_contribution/sub_sections["identification"] so the
MBR visualization has valid data.
🧹 Nitpick comments (4)
pmultiqc/modules/fragpipe/fragpipe.py (4)
161-185: Consider catching more specific exceptions.The generic
Exceptioncatches work for graceful fallback, but catching specific exceptions (e.g.,FileNotFoundError,ValueError,KeyError) would improve debuggability and avoid masking unexpected errors.♻️ Example for workflow parsing
if self.fragpipe_files.get("workflow"): try: workflow_path = self.fragpipe_files["workflow"][0] self.parameters = workflow_reader(workflow_path) if self.parameters: log.info("Workflow parameters loaded successfully.") - except Exception as e: + except (FileNotFoundError, IOError, ValueError) as e: log.warning(f"Error parsing workflow file: {e}")
198-213: Use defensive access forcontaminant_affixconfig.Direct dictionary access
config.kwargs["contaminant_affix"]will raiseKeyErrorif the config is missing. Consider using.get()with a default value for robustness.♻️ Proposed fix
if self.protein_df is not None and self.protein_intensity_cols: - contam_affix = config.kwargs["contaminant_affix"] + contam_affix = config.kwargs.get("contaminant_affix", "CONT") ( self.protein_intensity_distribution, self.protein_contam_distribution
1152-1165: Fix implicitOptionaltype hint.Per PEP 484, use explicit
Optional[dict]ordict | Noneinstead ofdict = None.♻️ Proposed fix
`@staticmethod` - def draw_protein_intensity_distribution(sub_section, sample_distribution: dict, contam_distribution: dict = None): + def draw_protein_intensity_distribution(sub_section, sample_distribution: dict, contam_distribution: dict | None = None):
1241-1251: Use_for unused loop variable.The loop variable
sampleat line 1243 is not used in the loop body.♻️ Proposed fix
has_data = False - for sample, stats in mbr_stats.items(): + for _, stats in mbr_stats.items(): proteins = stats.get('proteins', {}) if proteins.get('mbr_only', 0) > 0 or proteins.get('both', 0) > 0: has_data = True break
| # Parse combined_peptide.tsv | ||
| if self.fragpipe_files.get("combined_peptide"): | ||
| try: | ||
| peptide_path = self.fragpipe_files["combined_peptide"][0] | ||
| peptide_df, combined_peptide_column_valid = combined_peptide_reader(peptide_path) | ||
| log.info("Combined peptide data loaded successfully.") | ||
|
|
||
| if peptide_df is not None and combined_peptide_column_valid: | ||
| self.peptide_id_gain = cal_peptide_id_gain(peptide_df) | ||
| log.info("peptide statics loaded successfully.") | ||
|
|
||
| except Exception as e: | ||
| log.warning(f"Error parsing combined_peptide.tsv: {e}") |
There was a problem hiding this comment.
Fix log message placement and typo.
The success log at line 220 fires before validation, and line 224 has a typo ("statics" → "statistics").
🐛 Proposed fix
if self.fragpipe_files.get("combined_peptide"):
try:
peptide_path = self.fragpipe_files["combined_peptide"][0]
peptide_df, combined_peptide_column_valid = combined_peptide_reader(peptide_path)
- log.info("Combined peptide data loaded successfully.")
if peptide_df is not None and combined_peptide_column_valid:
+ log.info("Combined peptide data loaded successfully.")
self.peptide_id_gain = cal_peptide_id_gain(peptide_df)
- log.info("peptide statics loaded successfully.")
+ log.info("Peptide statistics loaded successfully.")
except Exception as e:
log.warning(f"Error parsing combined_peptide.tsv: {e}")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Parse combined_peptide.tsv | |
| if self.fragpipe_files.get("combined_peptide"): | |
| try: | |
| peptide_path = self.fragpipe_files["combined_peptide"][0] | |
| peptide_df, combined_peptide_column_valid = combined_peptide_reader(peptide_path) | |
| log.info("Combined peptide data loaded successfully.") | |
| if peptide_df is not None and combined_peptide_column_valid: | |
| self.peptide_id_gain = cal_peptide_id_gain(peptide_df) | |
| log.info("peptide statics loaded successfully.") | |
| except Exception as e: | |
| log.warning(f"Error parsing combined_peptide.tsv: {e}") | |
| # Parse combined_peptide.tsv | |
| if self.fragpipe_files.get("combined_peptide"): | |
| try: | |
| peptide_path = self.fragpipe_files["combined_peptide"][0] | |
| peptide_df, combined_peptide_column_valid = combined_peptide_reader(peptide_path) | |
| if peptide_df is not None and combined_peptide_column_valid: | |
| log.info("Combined peptide data loaded successfully.") | |
| self.peptide_id_gain = cal_peptide_id_gain(peptide_df) | |
| log.info("Peptide statistics loaded successfully.") | |
| except Exception as e: | |
| log.warning(f"Error parsing combined_peptide.tsv: {e}") |
🧰 Tools
🪛 Ruff (0.14.13)
226-226: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In `@pmultiqc/modules/fragpipe/fragpipe.py` around lines 215 - 227, The success
log is printed before validating the parsed DataFrame and contains a typo; move
the "Combined peptide data loaded successfully." log so it runs only after
confirming peptide_df is not None and combined_peptide_column_valid, change the
other log message in the same block from "peptide statics loaded successfully."
to "peptide statistics loaded successfully.", and ensure these messages are
emitted around the combined_peptide_reader call and the cal_peptide_id_gain
assignment (referencing combined_peptide_reader, peptide_df,
combined_peptide_column_valid, cal_peptide_id_gain, and self.peptide_id_gain)
while keeping the existing exception log (log.warning) intact.
| # Parse combined_ion.tsv for MS/MS counts | ||
| if self.fragpipe_files.get("combined_ion"): | ||
| try: | ||
| combined_ion_path = self.fragpipe_files["combined_ion"][0] | ||
| combined_ion_df, combined_ion_column_valid = combined_ion_reader(combined_ion_path) | ||
| log.info("Combined ion data loaded successfully.") | ||
|
|
||
| if combined_ion_df is not None and combined_ion_column_valid: | ||
| self.msms_counts = get_msms_counts_per_peak(combined_ion_df) | ||
| log.info("MS/MS counts per peak loaded successfully.") | ||
| except Exception as e: | ||
| log.warning(f"Error parsing combined_ion.tsv: {e}") |
There was a problem hiding this comment.
Log message fires before validation check.
Similar to the peptide section, the success log at line 234 fires before the validity check at line 236.
🐛 Proposed fix
if self.fragpipe_files.get("combined_ion"):
try:
combined_ion_path = self.fragpipe_files["combined_ion"][0]
combined_ion_df, combined_ion_column_valid = combined_ion_reader(combined_ion_path)
- log.info("Combined ion data loaded successfully.")
if combined_ion_df is not None and combined_ion_column_valid:
+ log.info("Combined ion data loaded successfully.")
self.msms_counts = get_msms_counts_per_peak(combined_ion_df)
log.info("MS/MS counts per peak loaded successfully.")🧰 Tools
🪛 Ruff (0.14.13)
239-239: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
In `@pmultiqc/modules/fragpipe/fragpipe.py` around lines 229 - 240, The success
log "Combined ion data loaded successfully." is emitted before you verify
combined_ion_df and combined_ion_column_valid; move or change the log so it only
runs after the validation check. In the try block call
combined_ion_reader(combined_ion_path) as you do, then if combined_ion_df is not
None and combined_ion_column_valid set self.msms_counts via
get_msms_counts_per_peak and call log.info("Combined ion data loaded
successfully.") there; otherwise emit a warning or skip logging. Update
references to combined_ion_reader, combined_ion_df, combined_ion_column_valid,
get_msms_counts_per_peak, self.msms_counts, and log.info accordingly.
| # MBR (Match-Between-Runs) visualization | ||
| # if self.mbr_stats: | ||
| # self.draw_mbr_contribution( | ||
| # sub_section=self.sub_sections["identification"], | ||
| # mbr_stats=self.mbr_stats | ||
| # ) | ||
|
|
||
| # MS/MS counts per peak from combined_ion.tsv | ||
| if self.msms_counts: | ||
| draw_oversampling( | ||
| self.sub_sections["ms2"], | ||
| self.msms_counts, | ||
| "", | ||
| "fragpipe", | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
rg -n "mbr_stats" --type pyRepository: bigbio/pmultiqc
Length of output: 1291
🏁 Script executed:
# Check the fragpipe.py file for imports and get_mbr_stats usage
rg -n "get_mbr_stats" --type pyRepository: bigbio/pmultiqc
Length of output: 219
🏁 Script executed:
# Get the fragpipe.py file to check imports at the top
head -50 pmultiqc/modules/fragpipe/fragpipe.py | cat -nRepository: bigbio/pmultiqc
Length of output: 1679
🏁 Script executed:
# Check the get_data method in fragpipe.py
rg -A 100 "def get_data" pmultiqc/modules/fragpipe/fragpipe.py | head -150Repository: bigbio/pmultiqc
Length of output: 4553
🏁 Script executed:
# Check if get_mbr_stats is called anywhere (not just imported)
rg -B 2 -A 5 "get_mbr_stats\(" pmultiqc/modules/fragpipe/fragpipe.pyRepository: bigbio/pmultiqc
Length of output: 41
🏁 Script executed:
# Search for self.mbr_stats assignment
rg -n "self\.mbr_stats\s*=" pmultiqc/modules/fragpipe/fragpipe.pyRepository: bigbio/pmultiqc
Length of output: 41
🏁 Script executed:
# Get more of the get_data method to see if mbr_stats is populated
sed -n '130,250p' pmultiqc/modules/fragpipe/fragpipe.py | cat -nRepository: bigbio/pmultiqc
Length of output: 6305
Remove unused MBR import or implement missing data loading.
The get_mbr_stats function is imported but never called, and self.mbr_stats is never populated in get_data(). The commented code at lines 391–395 references this undefined attribute. Either remove the dead import and commented visualization code, or complete the implementation by calling get_mbr_stats() with the protein and peptide dataframes to populate self.mbr_stats.
🤖 Prompt for AI Agents
In `@pmultiqc/modules/fragpipe/fragpipe.py` around lines 390 - 404, Either remove
the unused get_mbr_stats import and the commented draw_mbr_contribution block,
or populate self.mbr_stats in get_data() by calling get_mbr_stats(...) with the
appropriate protein/peptide DataFrames and then enable the visualization;
specifically, locate the get_data() method in fragpipe.py and set self.mbr_stats
= get_mbr_stats(protein_df, peptide_df) (or similar variable names used there),
ensure get_mbr_stats is imported, and then uncomment/restore the
draw_mbr_contribution(...) call that references self.mbr_stats and
draw_mbr_contribution/sub_sections["identification"] so the MBR visualization
has valid data.
Pull Request
Description
Brief description of the changes made in this PR.
Type of Change
Summary by CodeRabbit
New Features
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.