Skip to content

Peptide level distributions #554

Merged
ypriverol merged 2 commits intomainfrom
dev
Jan 18, 2026
Merged

Peptide level distributions #554
ypriverol merged 2 commits intomainfrom
dev

Conversation

@ypriverol
Copy link
Member

@ypriverol ypriverol commented Jan 18, 2026

Pull Request

Description

Brief description of the changes made in this PR.

Type of Change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Performance improvement
  • Code refactoring
  • Test addition/update
  • Updates to the dependencies has been done.

Summary by CodeRabbit

  • New Features
    • Added peptide length distribution visualizations across DIA-NN, FragPipe, MaxQuant, and QuantMS analysis modules, enabling users to view peptide length patterns per run/sample with detailed contextual guidance.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 18, 2026

📝 Walkthrough

Walkthrough

This PR adds peptide length distribution computation and visualization across multiple proteomics modules (DIA-NN, FragPipe, MaxQuant, QuantMS). It introduces helper functions to extract per-run peptide length statistics from parsed data and adds plotting functions to visualize these distributions.

Changes

Cohort / File(s) Summary
DIA-NN parsing utilities
pmultiqc/modules/common/dia_utils.py
Adds private helper _get_peptide_length(df) to compute per-run peptide length distributions from Stripped.Sequence column; propagates result through parse_diann_report return tuple
FragPipe integration
pmultiqc/modules/fragpipe/fragpipe.py, pmultiqc/modules/fragpipe/fragpipe_io.py
Adds self.peptide_length data member and draw_peptide_length() static method; updates parse_psm return values to include peptide_length; adds "Peptide Length" to REQUIRED_COLS["psm"]
MaxQuant integration
pmultiqc/modules/maxquant/maxquant.py, pmultiqc/modules/maxquant/maxquant_utils.py
Introduces evidence_peptide_length(df) function to compute distribution per raw file; adds plotting call via id_plots.draw_peptide_length_distribution; initializes peptide_length field in evidence data structure
QuantMS integration
pmultiqc/modules/quantms/quantms.py
Adds self.peptide_length dictionary initialization; computes and stores peptide length stats in parsing flow; calls draw_peptide_length_distribution in plotting sequence
DIA-NN module
pmultiqc/modules/diann/diann.py
Imports draw_peptide_length_distribution and integrates peptide_length plotting after long_trends rendering when data is present
Common plotting utilities
pmultiqc/modules/common/plots/id.py
Adds draw_peptide_length_distribution(sub_section, plot_data) function with linegraph visualization, legend, and helptext describing peptide length across platforms (FragPipe, MaxQuant, DIA-NN, quantms); note: function defined twice with identical implementations
mzTab PSM parsing
pmultiqc/modules/common/ms/mztab.py
Adds pep_length column to PSM dataframe by computing sequence string length

Possibly related issues

Possibly related PRs

  • Dev #237 — Modifies QuantMSModule.parse_diann_report (this PR adds peptide_length propagation; that PR alters file-format handling and grouping logic)
  • SDRF integration in more plots  #484 — Modifies parse_diann_report in pmultiqc/modules/common/dia_utils.py (overlapping function modifications)
  • FragPipe first implementation #495 — Modifies FragPipe integration including fragpipe.py and fragpipe_io.py (overlapping PSM/column handling)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

Review effort 3/5

Suggested reviewers

  • yueqixuan
  • daichengxin

Poem

🐰 Hopping through peptides, long and short,
Six modules now report with care,
Length distributions, plotted fair,
From FragPipe to DIA-NN's fort,
A feature blooms, shared everywhere!

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 24.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title "Peptide level distributions" accurately reflects the main change: adding peptide length distribution visualization across multiple proteomics analysis modules.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@ypriverol ypriverol merged commit 61a0b13 into main Jan 18, 2026
21 of 23 checks passed
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@pmultiqc/modules/fragpipe/fragpipe.py`:
- Around line 893-917: The "Peptide Length" column may be string-typed causing
lexicographic ordering; inside draw_peptide_length convert df["Peptide Length"]
to numeric (use pandas.to_numeric with errors='coerce'), drop or dropna invalid
entries, optionally cast to int, then proceed to compute value_counts() and
sort_index() so bins are ordered numerically; update references to df and the
grouping logic in draw_peptide_length to use the cleaned numeric column
(mirroring the approach used in draw_delta_mass() and draw_ids_over_rt()) before
building plot_data for draw_peptide_length_distribution.
🧹 Nitpick comments (4)
pmultiqc/modules/common/ms/mztab.py (1)

138-139: Consider handling potential null values in sequence column.

If the sequence column contains NaN or None values, len(x) will raise a TypeError. Consider adding null-safety:

-        psm["pep_length"] = psm["sequence"].apply(lambda x: len(x))
+        psm["pep_length"] = psm["sequence"].apply(lambda x: len(x) if pd.notna(x) else 0)

Also, note that this column is named pep_length while other modules (MaxQuant, DIA-NN, FragPipe) use length or peptide_length. Consider aligning the naming for consistency across the codebase.

pmultiqc/modules/maxquant/maxquant_utils.py (1)

1235-1249: Remove unused column from validation check.

The function validates that both length and sequence columns exist, but only uses length (and raw file). The sequence column check appears unnecessary.

♻️ Suggested fix
 def evidence_peptide_length(df):
     if any(
             column not in df.columns
-            for column in ["length", "sequence"]
+            for column in ["length", "raw file"]
     ):
         return None
pmultiqc/modules/common/dia_utils.py (1)

323-337: Fix membership test syntax (E713) and add safety for null values.

Per static analysis, use not in instead of not ... in. Also consider handling potential null values in Stripped.Sequence.

♻️ Suggested fix
 def _get_peptide_length(df):
 
-    if not "Stripped.Sequence" in df.columns:
+    if "Stripped.Sequence" not in df.columns:
         return None
 
     df_sub = df[["Run", "Stripped.Sequence"]].copy()
-    df_sub["length"] = df_sub["Stripped.Sequence"].apply(lambda x: len(x))
+    df_sub = df_sub.dropna(subset=["Stripped.Sequence"])
+    df_sub["length"] = df_sub["Stripped.Sequence"].apply(len)
 
     plot_data = {}
     for run, group in df_sub.groupby("Run"):
         stats_dict = group["length"].value_counts().sort_index().to_dict()
         plot_data[run] = stats_dict
 
     return plot_data
pmultiqc/modules/common/plots/id.py (1)

1137-1167: LGTM with minor naming suggestion.

The function is well-structured and follows the established pattern for plotting functions. The configuration and helptext are appropriate.

Consider renaming box_html to line_html for clarity since this uses linegraph.plot, not a box plot:

♻️ Optional naming improvement
-    box_html = linegraph.plot(plot_data, pconfig=draw_config)
-
-    box_html = plot_html_check(box_html)
+    line_html = linegraph.plot(plot_data, pconfig=draw_config)
+
+    line_html = plot_html_check(line_html)
 
     add_sub_section(
         sub_section=sub_section,
-        plot=box_html,
+        plot=line_html,
         order=8,

Comment on lines +893 to +917
# Peptide Length Distribution
@staticmethod
def draw_peptide_length(sub_section, peptide_length: list):

if not peptide_length:
log.warning("No peptide_length data; skipping peptide_length.")
return

df = pd.concat(peptide_length, ignore_index=True)

if df.empty:
log.warning("Peptide Length DataFrame is empty; skipping Peptide Length.")
return

log.info(f"Number of Peptide Length rows in DataFrame: {len(df)}")

plot_data = {}
for sample, group in df.groupby("Run"):
stats_dict = group["Peptide Length"].value_counts().sort_index().to_dict()
plot_data[sample] = stats_dict

draw_peptide_length_distribution(
sub_section=sub_section,
plot_data=plot_data
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the file and context around the flagged lines
cd /tmp && find . -type f -name "fragpipe.py" -path "*/modules/fragpipe/*" 2>/dev/null | head -5

Repository: bigbio/pmultiqc

Length of output: 41


🏁 Script executed:

# Search for the file in the repository
find . -type f -name "fragpipe.py" 2>/dev/null | grep -i fragpipe

Repository: bigbio/pmultiqc

Length of output: 98


🏁 Script executed:

# List files to understand structure
git ls-files | grep -i fragpipe | head -20

Repository: bigbio/pmultiqc

Length of output: 264


🏁 Script executed:

# Read the fragpipe.py file and check the relevant section and surrounding context
head -950 pmultiqc/modules/fragpipe/fragpipe.py | tail -150

Repository: bigbio/pmultiqc

Length of output: 5013


🏁 Script executed:

# Search for where peptide_length is populated
rg "peptide_length" pmultiqc/modules/fragpipe/ -n

Repository: bigbio/pmultiqc

Length of output: 1277


🏁 Script executed:

# Look at fragpipe_io.py to understand data parsing
cat -n pmultiqc/modules/fragpipe/fragpipe_io.py | head -200

Repository: bigbio/pmultiqc

Length of output: 8723


🏁 Script executed:

# Check the test PSM file to see what "Peptide Length" values look like
head -20 tests/resources/fragpipe/psm.tsv | cut -f1-20

Repository: bigbio/pmultiqc

Length of output: 5803


🏁 Script executed:

# Get column headers and check for Peptide Length column
head -1 tests/resources/fragpipe/psm.tsv | tr '\t' '\n' | nl

Repository: bigbio/pmultiqc

Length of output: 101


🏁 Script executed:

# Extract just the Peptide Length column from the test file
cut -f $(head -1 tests/resources/fragpipe/psm.tsv | tr '\t' '\n' | grep -n "Peptide Length" | cut -d: -f1) tests/resources/fragpipe/psm.tsv | head -20

Repository: bigbio/pmultiqc

Length of output: 115


🏁 Script executed:

# Check the exact data types when read by psm_reader
python3 << 'EOF'
import pandas as pd

# Simulate what psm_reader does
psm_df = pd.read_csv("tests/resources/fragpipe/psm.tsv", sep="\t")

# Check the Peptide Length column type
print(f"Peptide Length dtype: {psm_df['Peptide Length'].dtype}")
print(f"Sample values: {psm_df['Peptide Length'].head(10).tolist()}")

# Test the issue - how value_counts().sort_index() behaves
test_dict = psm_df[["Peptide Length"]].groupby(psm_df["Peptide Length"]).size().to_dict()
print(f"\nvalue_counts().sort_index() result (as dict): {test_dict}")

# If it were string-typed, show the difference
psm_df["Peptide Length"] = psm_df["Peptide Length"].astype(str)
test_dict_str = psm_df[["Peptide Length"]].groupby(psm_df["Peptide Length"]).size().to_dict()
print(f"String-typed sort_index() result: {test_dict_str}")
EOF

Repository: bigbio/pmultiqc

Length of output: 145


🏁 Script executed:

# Check what happens in the actual draw_peptide_length function with the current data
python3 << 'EOF'
import pandas as pd

# Simulate what happens in draw_peptide_length
psm_df = pd.read_csv("tests/resources/fragpipe/psm.tsv", sep="\t")

# Extract Run and Peptide Length as done in line 519
peptide_length_data = [psm_df[["Run", "Peptide Length"]].copy()]

# Now simulate the draw_peptide_length function
df = pd.concat(peptide_length_data, ignore_index=True)

print(f"After concat, dtype: {df['Peptide Length'].dtype}")

# Simulate the groupby and value_counts operations
for sample, group in df.groupby("Run"):
    stats_dict = group["Peptide Length"].value_counts().sort_index().to_dict()
    print(f"Sample: {sample}")
    print(f"Stats dict: {stats_dict}")
    print(f"Keys type: {type(list(stats_dict.keys())[0]) if stats_dict else 'N/A'}")
    break  # Just check the first one
EOF

Repository: bigbio/pmultiqc

Length of output: 145


🏁 Script executed:

# Check if there's any explicit dtype conversion in the psm_reader that I might have missed
sed -n '125,138p' pmultiqc/modules/fragpipe/fragpipe_io.py

Repository: bigbio/pmultiqc

Length of output: 480


🏁 Script executed:

# Let me check the actual behavior with a simpler test
python3 << 'EOF'
import pandas as pd
import io

# Create test data exactly like what pandas would read from the TSV
data = """Run	Peptide Length
sample1	10
sample1	8
sample1	7
sample1	10
sample1	11
sample1	2
sample1	1"""

df = pd.read_csv(io.StringIO(data), sep="\t")
print(f"Inferred dtype: {df['Peptide Length'].dtype}")

# Test current behavior
group_dict = df["Peptide Length"].value_counts().sort_index().to_dict()
print(f"Current sort_index() result: {group_dict}")
print(f"Order of keys: {list(group_dict.keys())}")

# Now test if it were string-typed (the concern in the review)
df_str = df.copy()
df_str["Peptide Length"] = df_str["Peptide Length"].astype(str)
print(f"\nString dtype: {df_str['Peptide Length'].dtype}")
group_dict_str = df_str["Peptide Length"].value_counts().sort_index().to_dict()
print(f"String sort_index() result: {group_dict_str}")
print(f"Order of keys (WRONG): {list(group_dict_str.keys())}")
EOF

Repository: bigbio/pmultiqc

Length of output: 177


🏁 Script executed:

# Check if there's any test for the peptide_length function to see how it's actually used
rg "draw_peptide_length" tests/ -A 5 -B 2

Repository: bigbio/pmultiqc

Length of output: 41


🏁 Script executed:

# Check the fragpipe module test file
head -100 tests/test_fragpipe.py

Repository: bigbio/pmultiqc

Length of output: 4112


🏁 Script executed:

# Check how pandas infers the dtype when reading the test PSM file
python3 << 'EOF'
import pandas as pd

# Read the test PSM file as pandas would
df = pd.read_csv("tests/resources/fragpipe/psm.tsv", sep="\t", nrows=10)

# Check the dtype of Peptide Length
print(f"Peptide Length dtype: {df['Peptide Length'].dtype}")
print(f"Sample values: {df['Peptide Length'].tolist()}")

# Check what value_counts().sort_index() produces
vc = df["Peptide Length"].value_counts().sort_index()
print(f"\nvalue_counts().sort_index() keys: {list(vc.index)}")
print(f"Keys are numeric: {all(isinstance(k, (int, float)) for k in vc.index)}")
EOF

Repository: bigbio/pmultiqc

Length of output: 177


🏁 Script executed:

# Check how the data flows through the code - look at where peptide_length is populated
sed -n '447,555p' pmultiqc/modules/fragpipe/fragpipe.py | head -120

Repository: bigbio/pmultiqc

Length of output: 3864


🏁 Script executed:

# Check how value_counts and sort_index interact in different scenarios
# Let me look at the exact code pattern more carefully
sed -n '893,920p' pmultiqc/modules/fragpipe/fragpipe.py

Repository: bigbio/pmultiqc

Length of output: 908


🏁 Script executed:

# Let's also compare the two similar functions side by side to see the difference
# draw_ids_over_rt (which has numeric conversion) vs draw_peptide_length (which doesn't)
sed -n '925,950p' pmultiqc/modules/fragpipe/fragpipe.py

Repository: bigbio/pmultiqc

Length of output: 780


🏁 Script executed:

# Check if there are any other similar patterns in the file where numeric conversion is used
rg "pd.to_numeric" pmultiqc/modules/fragpipe/fragpipe.py -B 2 -A 2

Repository: bigbio/pmultiqc

Length of output: 348


Ensure peptide lengths are numeric to avoid lexicographic bin ordering.

The "Peptide Length" column can be read as string-typed by pandas, causing sort_index() to order as 1,10,11,2... instead of 1,2,10,11.... This pattern is already used in other similar functions in the same file (e.g., draw_delta_mass(), draw_ids_over_rt()), so apply the same defensive approach here.

Proposed fix
         df = pd.concat(peptide_length, ignore_index=True)
+        df["Peptide Length"] = pd.to_numeric(df["Peptide Length"], errors="coerce")
+        df = df.dropna(subset=["Peptide Length"])
 
         if df.empty:
             log.warning("Peptide Length DataFrame is empty; skipping Peptide Length.")
             return
 
         log.info(f"Number of Peptide Length rows in DataFrame: {len(df)}")
 
         plot_data = {}
         for sample, group in df.groupby("Run"):
-            stats_dict = group["Peptide Length"].value_counts().sort_index().to_dict()
+            stats_dict = (
+                group["Peptide Length"]
+                .astype(int)
+                .value_counts()
+                .sort_index()
+                .to_dict()
+            )
             plot_data[sample] = stats_dict
🤖 Prompt for AI Agents
In `@pmultiqc/modules/fragpipe/fragpipe.py` around lines 893 - 917, The "Peptide
Length" column may be string-typed causing lexicographic ordering; inside
draw_peptide_length convert df["Peptide Length"] to numeric (use
pandas.to_numeric with errors='coerce'), drop or dropna invalid entries,
optionally cast to int, then proceed to compute value_counts() and sort_index()
so bins are ordered numerically; update references to df and the grouping logic
in draw_peptide_length to use the cleaned numeric column (mirroring the approach
used in draw_delta_mass() and draw_ids_over_rt()) before building plot_data for
draw_peptide_length_distribution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants