Peptide level distributions by ypriverol · Pull Request #554 · bigbio/pmultiqc

ypriverol · 2026-01-18T13:44:28Z

Pull Request

Description

Brief description of the changes made in this PR.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Performance improvement
Code refactoring
Test addition/update
Updates to the dependencies has been done.

Summary by CodeRabbit

New Features
- Added peptide length distribution visualizations across DIA-NN, FragPipe, MaxQuant, and QuantMS analysis modules, enabling users to view peptide length patterns per run/sample with detailed contextual guidance.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Add peptide length distribution

coderabbitai · 2026-01-18T13:44:59Z

📝 Walkthrough

Walkthrough

This PR adds peptide length distribution computation and visualization across multiple proteomics modules (DIA-NN, FragPipe, MaxQuant, QuantMS). It introduces helper functions to extract per-run peptide length statistics from parsed data and adds plotting functions to visualize these distributions.

Changes

Cohort / File(s)	Summary
DIA-NN parsing utilities `pmultiqc/modules/common/dia_utils.py`	Adds private helper `_get_peptide_length(df)` to compute per-run peptide length distributions from Stripped.Sequence column; propagates result through `parse_diann_report` return tuple
FragPipe integration `pmultiqc/modules/fragpipe/fragpipe.py`, `pmultiqc/modules/fragpipe/fragpipe_io.py`	Adds `self.peptide_length` data member and `draw_peptide_length()` static method; updates `parse_psm` return values to include peptide_length; adds "Peptide Length" to REQUIRED_COLS["psm"]
MaxQuant integration `pmultiqc/modules/maxquant/maxquant.py`, `pmultiqc/modules/maxquant/maxquant_utils.py`	Introduces `evidence_peptide_length(df)` function to compute distribution per raw file; adds plotting call via `id_plots.draw_peptide_length_distribution`; initializes peptide_length field in evidence data structure
QuantMS integration `pmultiqc/modules/quantms/quantms.py`	Adds `self.peptide_length` dictionary initialization; computes and stores peptide length stats in parsing flow; calls `draw_peptide_length_distribution` in plotting sequence
DIA-NN module `pmultiqc/modules/diann/diann.py`	Imports `draw_peptide_length_distribution` and integrates peptide_length plotting after long_trends rendering when data is present
Common plotting utilities `pmultiqc/modules/common/plots/id.py`	Adds `draw_peptide_length_distribution(sub_section, plot_data)` function with linegraph visualization, legend, and helptext describing peptide length across platforms (FragPipe, MaxQuant, DIA-NN, quantms); note: function defined twice with identical implementations
mzTab PSM parsing `pmultiqc/modules/common/ms/mztab.py`	Adds `pep_length` column to PSM dataframe by computing sequence string length

Possibly related issues

[GENERAL] New metrics based on FragPipe feedback #546 — Main changes implement per-run peptide length extraction and visualization, directly addressing the requested "Peptide Length Distribution" metric feature across all supported platforms

Possibly related PRs

Dev #237 — Modifies QuantMSModule.parse_diann_report (this PR adds peptide_length propagation; that PR alters file-format handling and grouping logic)
SDRF integration in more plots #484 — Modifies parse_diann_report in pmultiqc/modules/common/dia_utils.py (overlapping function modifications)
FragPipe first implementation #495 — Modifies FragPipe integration including fragpipe.py and fragpipe_io.py (overlapping PSM/column handling)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

Review effort 3/5

Suggested reviewers

yueqixuan
daichengxin

Poem

🐰 Hopping through peptides, long and short,
Six modules now report with care,
Length distributions, plotted fair,
From FragPipe to DIA-NN's fort,
A feature blooms, shared everywhere! ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 24.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title "Peptide level distributions" accurately reflects the main change: adding peptide length distribution visualization across multiple proteomics analysis modules.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@pmultiqc/modules/fragpipe/fragpipe.py`:
- Around line 893-917: The "Peptide Length" column may be string-typed causing
lexicographic ordering; inside draw_peptide_length convert df["Peptide Length"]
to numeric (use pandas.to_numeric with errors='coerce'), drop or dropna invalid
entries, optionally cast to int, then proceed to compute value_counts() and
sort_index() so bins are ordered numerically; update references to df and the
grouping logic in draw_peptide_length to use the cleaned numeric column
(mirroring the approach used in draw_delta_mass() and draw_ids_over_rt()) before
building plot_data for draw_peptide_length_distribution.

🧹 Nitpick comments (4)

pmultiqc/modules/common/ms/mztab.py (1)
138-139: Consider handling potential null values in sequence column.

If the sequence column contains NaN or None values, len(x) will raise a TypeError. Consider adding null-safety:
-        psm["pep_length"] = psm["sequence"].apply(lambda x: len(x))
+        psm["pep_length"] = psm["sequence"].apply(lambda x: len(x) if pd.notna(x) else 0)
Also, note that this column is named pep_length while other modules (MaxQuant, DIA-NN, FragPipe) use length or peptide_length. Consider aligning the naming for consistency across the codebase.
pmultiqc/modules/maxquant/maxquant_utils.py (1)
1235-1249: Remove unused column from validation check.

The function validates that both length and sequence columns exist, but only uses length (and raw file). The sequence column check appears unnecessary.
♻️ Suggested fix
 def evidence_peptide_length(df):
     if any(
             column not in df.columns
-            for column in ["length", "sequence"]
+            for column in ["length", "raw file"]
     ):
         return None
pmultiqc/modules/common/dia_utils.py (1)
323-337: Fix membership test syntax (E713) and add safety for null values.

Per static analysis, use not in instead of not ... in. Also consider handling potential null values in Stripped.Sequence.
♻️ Suggested fix
 def _get_peptide_length(df):
 
-    if not "Stripped.Sequence" in df.columns:
+    if "Stripped.Sequence" not in df.columns:
         return None
 
     df_sub = df[["Run", "Stripped.Sequence"]].copy()
-    df_sub["length"] = df_sub["Stripped.Sequence"].apply(lambda x: len(x))
+    df_sub = df_sub.dropna(subset=["Stripped.Sequence"])
+    df_sub["length"] = df_sub["Stripped.Sequence"].apply(len)
 
     plot_data = {}
     for run, group in df_sub.groupby("Run"):
         stats_dict = group["length"].value_counts().sort_index().to_dict()
         plot_data[run] = stats_dict
 
     return plot_data
pmultiqc/modules/common/plots/id.py (1)
1137-1167: LGTM with minor naming suggestion.

The function is well-structured and follows the established pattern for plotting functions. The configuration and helptext are appropriate.

Consider renaming box_html to line_html for clarity since this uses linegraph.plot, not a box plot:
♻️ Optional naming improvement
-    box_html = linegraph.plot(plot_data, pconfig=draw_config)
-
-    box_html = plot_html_check(box_html)
+    line_html = linegraph.plot(plot_data, pconfig=draw_config)
+
+    line_html = plot_html_check(line_html)
 
     add_sub_section(
         sub_section=sub_section,
-        plot=box_html,
+        plot=line_html,
         order=8,

coderabbitai · 2026-01-18T13:54:09Z

pmultiqc/modules/fragpipe/fragpipe.py

+    # Peptide Length Distribution
+    @staticmethod
+    def draw_peptide_length(sub_section, peptide_length: list):
+
+        if not peptide_length:
+            log.warning("No peptide_length data; skipping peptide_length.")
+            return
+
+        df = pd.concat(peptide_length, ignore_index=True)
+
+        if df.empty:
+            log.warning("Peptide Length DataFrame is empty; skipping Peptide Length.")
+            return
+
+        log.info(f"Number of Peptide Length rows in DataFrame: {len(df)}")
+
+        plot_data = {}
+        for sample, group in df.groupby("Run"):
+            stats_dict = group["Peptide Length"].value_counts().sort_index().to_dict()
+            plot_data[sample] = stats_dict
+
+        draw_peptide_length_distribution(
+            sub_section=sub_section,
+            plot_data=plot_data
+        )


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the file and context around the flagged lines cd /tmp && find . -type f -name "fragpipe.py" -path "*/modules/fragpipe/*" 2>/dev/null | head -5

Repository: bigbio/pmultiqc

Length of output: 41

🏁 Script executed:

# Search for the file in the repository find . -type f -name "fragpipe.py" 2>/dev/null | grep -i fragpipe

Repository: bigbio/pmultiqc

Length of output: 98

🏁 Script executed:

# List files to understand structure git ls-files | grep -i fragpipe | head -20

Repository: bigbio/pmultiqc

Length of output: 264

🏁 Script executed:

# Read the fragpipe.py file and check the relevant section and surrounding context head -950 pmultiqc/modules/fragpipe/fragpipe.py | tail -150

Repository: bigbio/pmultiqc

Length of output: 5013

🏁 Script executed:

# Search for where peptide_length is populated rg "peptide_length" pmultiqc/modules/fragpipe/ -n

Repository: bigbio/pmultiqc

Length of output: 1277

🏁 Script executed:

# Look at fragpipe_io.py to understand data parsing cat -n pmultiqc/modules/fragpipe/fragpipe_io.py | head -200

Repository: bigbio/pmultiqc

Length of output: 8723

🏁 Script executed:

# Check the test PSM file to see what "Peptide Length" values look like head -20 tests/resources/fragpipe/psm.tsv | cut -f1-20

Repository: bigbio/pmultiqc

Length of output: 5803

🏁 Script executed:

# Get column headers and check for Peptide Length column head -1 tests/resources/fragpipe/psm.tsv | tr '\t' '\n' | nl

Repository: bigbio/pmultiqc

Length of output: 101

🏁 Script executed:

# Extract just the Peptide Length column from the test file cut -f $(head -1 tests/resources/fragpipe/psm.tsv | tr '\t' '\n' | grep -n "Peptide Length" | cut -d: -f1) tests/resources/fragpipe/psm.tsv | head -20

Repository: bigbio/pmultiqc

Length of output: 115

🏁 Script executed:

# Check the exact data types when read by psm_reader python3 << 'EOF' import pandas as pd # Simulate what psm_reader does psm_df = pd.read_csv("tests/resources/fragpipe/psm.tsv", sep="\t") # Check the Peptide Length column type print(f"Peptide Length dtype: {psm_df['Peptide Length'].dtype}") print(f"Sample values: {psm_df['Peptide Length'].head(10).tolist()}") # Test the issue - how value_counts().sort_index() behaves test_dict = psm_df[["Peptide Length"]].groupby(psm_df["Peptide Length"]).size().to_dict() print(f"\nvalue_counts().sort_index() result (as dict): {test_dict}") # If it were string-typed, show the difference psm_df["Peptide Length"] = psm_df["Peptide Length"].astype(str) test_dict_str = psm_df[["Peptide Length"]].groupby(psm_df["Peptide Length"]).size().to_dict() print(f"String-typed sort_index() result: {test_dict_str}") EOF

Repository: bigbio/pmultiqc

Length of output: 145

🏁 Script executed:

# Check what happens in the actual draw_peptide_length function with the current data python3 << 'EOF' import pandas as pd # Simulate what happens in draw_peptide_length psm_df = pd.read_csv("tests/resources/fragpipe/psm.tsv", sep="\t") # Extract Run and Peptide Length as done in line 519 peptide_length_data = [psm_df[["Run", "Peptide Length"]].copy()] # Now simulate the draw_peptide_length function df = pd.concat(peptide_length_data, ignore_index=True) print(f"After concat, dtype: {df['Peptide Length'].dtype}") # Simulate the groupby and value_counts operations for sample, group in df.groupby("Run"): stats_dict = group["Peptide Length"].value_counts().sort_index().to_dict() print(f"Sample: {sample}") print(f"Stats dict: {stats_dict}") print(f"Keys type: {type(list(stats_dict.keys())[0]) if stats_dict else 'N/A'}") break # Just check the first one EOF

Repository: bigbio/pmultiqc

Length of output: 145

🏁 Script executed:

# Check if there's any explicit dtype conversion in the psm_reader that I might have missed sed -n '125,138p' pmultiqc/modules/fragpipe/fragpipe_io.py

Repository: bigbio/pmultiqc

Length of output: 480

🏁 Script executed:

# Let me check the actual behavior with a simpler test python3 << 'EOF' import pandas as pd import io # Create test data exactly like what pandas would read from the TSV data = """Run Peptide Length sample1 10 sample1 8 sample1 7 sample1 10 sample1 11 sample1 2 sample1 1""" df = pd.read_csv(io.StringIO(data), sep="\t") print(f"Inferred dtype: {df['Peptide Length'].dtype}") # Test current behavior group_dict = df["Peptide Length"].value_counts().sort_index().to_dict() print(f"Current sort_index() result: {group_dict}") print(f"Order of keys: {list(group_dict.keys())}") # Now test if it were string-typed (the concern in the review) df_str = df.copy() df_str["Peptide Length"] = df_str["Peptide Length"].astype(str) print(f"\nString dtype: {df_str['Peptide Length'].dtype}") group_dict_str = df_str["Peptide Length"].value_counts().sort_index().to_dict() print(f"String sort_index() result: {group_dict_str}") print(f"Order of keys (WRONG): {list(group_dict_str.keys())}") EOF

Repository: bigbio/pmultiqc

Length of output: 177

🏁 Script executed:

# Check if there's any test for the peptide_length function to see how it's actually used rg "draw_peptide_length" tests/ -A 5 -B 2

Repository: bigbio/pmultiqc

Length of output: 41

🏁 Script executed:

# Check the fragpipe module test file head -100 tests/test_fragpipe.py

Repository: bigbio/pmultiqc

Length of output: 4112

🏁 Script executed:

# Check how pandas infers the dtype when reading the test PSM file python3 << 'EOF' import pandas as pd # Read the test PSM file as pandas would df = pd.read_csv("tests/resources/fragpipe/psm.tsv", sep="\t", nrows=10) # Check the dtype of Peptide Length print(f"Peptide Length dtype: {df['Peptide Length'].dtype}") print(f"Sample values: {df['Peptide Length'].tolist()}") # Check what value_counts().sort_index() produces vc = df["Peptide Length"].value_counts().sort_index() print(f"\nvalue_counts().sort_index() keys: {list(vc.index)}") print(f"Keys are numeric: {all(isinstance(k, (int, float)) for k in vc.index)}") EOF

Repository: bigbio/pmultiqc

Length of output: 177

🏁 Script executed:

# Check how the data flows through the code - look at where peptide_length is populated sed -n '447,555p' pmultiqc/modules/fragpipe/fragpipe.py | head -120

Repository: bigbio/pmultiqc

Length of output: 3864

🏁 Script executed:

# Check how value_counts and sort_index interact in different scenarios # Let me look at the exact code pattern more carefully sed -n '893,920p' pmultiqc/modules/fragpipe/fragpipe.py

Repository: bigbio/pmultiqc

Length of output: 908

🏁 Script executed:

# Let's also compare the two similar functions side by side to see the difference # draw_ids_over_rt (which has numeric conversion) vs draw_peptide_length (which doesn't) sed -n '925,950p' pmultiqc/modules/fragpipe/fragpipe.py

Repository: bigbio/pmultiqc

Length of output: 780

🏁 Script executed:

# Check if there are any other similar patterns in the file where numeric conversion is used rg "pd.to_numeric" pmultiqc/modules/fragpipe/fragpipe.py -B 2 -A 2

Repository: bigbio/pmultiqc

Length of output: 348

Ensure peptide lengths are numeric to avoid lexicographic bin ordering.

The "Peptide Length" column can be read as string-typed by pandas, causing sort_index() to order as 1,10,11,2... instead of 1,2,10,11.... This pattern is already used in other similar functions in the same file (e.g., draw_delta_mass(), draw_ids_over_rt()), so apply the same defensive approach here.

Proposed fix

df = pd.concat(peptide_length, ignore_index=True) + df["Peptide Length"] = pd.to_numeric(df["Peptide Length"], errors="coerce") + df = df.dropna(subset=["Peptide Length"]) if df.empty: log.warning("Peptide Length DataFrame is empty; skipping Peptide Length.") return log.info(f"Number of Peptide Length rows in DataFrame: {len(df)}") plot_data = {} for sample, group in df.groupby("Run"): - stats_dict = group["Peptide Length"].value_counts().sort_index().to_dict() + stats_dict = ( + group["Peptide Length"] + .astype(int) + .value_counts() + .sort_index() + .to_dict() + ) plot_data[sample] = stats_dict

🤖 Prompt for AI Agents

In `@pmultiqc/modules/fragpipe/fragpipe.py` around lines 893 - 917, The "Peptide Length" column may be string-typed causing lexicographic ordering; inside draw_peptide_length convert df["Peptide Length"] to numeric (use pandas.to_numeric with errors='coerce'), drop or dropna invalid entries, optionally cast to int, then proceed to compute value_counts() and sort_index() so bins are ordered numerically; update references to df and the grouping logic in draw_peptide_length to use the cleaned numeric column (mirroring the approach used in draw_delta_mass() and draw_ids_over_rt()) before building plot_data for draw_peptide_length_distribution.

yueqixuan and others added 2 commits January 18, 2026 21:22

add peptide length distribution

1d3642c

Merge pull request #553 from yueqixuan/dev

42429f0

Add peptide length distribution

ypriverol merged commit 61a0b13 into main Jan 18, 2026
21 of 23 checks passed

coderabbitai bot reviewed Jan 18, 2026

View reviewed changes

ypriverol mentioned this pull request Feb 7, 2026

[GENERAL] New metrics based on FragPipe feedback #546

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Peptide level distributions #554

Peptide level distributions #554
ypriverol merged 2 commits intomainfrom
dev

ypriverol commented Jan 18, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jan 18, 2026 •

edited

Loading

Walkthrough

Changes

Possibly related issues

Possibly related PRs

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Jan 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ypriverol commented Jan 18, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request

Description

Type of Change

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Possibly related issues

Possibly related PRs

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ypriverol commented Jan 18, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 18, 2026 •

edited

Loading