Conversation
Add peptide length distribution
📝 WalkthroughWalkthroughThis PR adds peptide length distribution computation and visualization across multiple proteomics modules (DIA-NN, FragPipe, MaxQuant, QuantMS). It introduces helper functions to extract per-run peptide length statistics from parsed data and adds plotting functions to visualize these distributions. Changes
Possibly related issues
Possibly related PRs
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@pmultiqc/modules/fragpipe/fragpipe.py`:
- Around line 893-917: The "Peptide Length" column may be string-typed causing
lexicographic ordering; inside draw_peptide_length convert df["Peptide Length"]
to numeric (use pandas.to_numeric with errors='coerce'), drop or dropna invalid
entries, optionally cast to int, then proceed to compute value_counts() and
sort_index() so bins are ordered numerically; update references to df and the
grouping logic in draw_peptide_length to use the cleaned numeric column
(mirroring the approach used in draw_delta_mass() and draw_ids_over_rt()) before
building plot_data for draw_peptide_length_distribution.
🧹 Nitpick comments (4)
pmultiqc/modules/common/ms/mztab.py (1)
138-139: Consider handling potential null values insequencecolumn.If the
sequencecolumn containsNaNorNonevalues,len(x)will raise aTypeError. Consider adding null-safety:- psm["pep_length"] = psm["sequence"].apply(lambda x: len(x)) + psm["pep_length"] = psm["sequence"].apply(lambda x: len(x) if pd.notna(x) else 0)Also, note that this column is named
pep_lengthwhile other modules (MaxQuant, DIA-NN, FragPipe) uselengthorpeptide_length. Consider aligning the naming for consistency across the codebase.pmultiqc/modules/maxquant/maxquant_utils.py (1)
1235-1249: Remove unused column from validation check.The function validates that both
lengthandsequencecolumns exist, but only useslength(andraw file). Thesequencecolumn check appears unnecessary.♻️ Suggested fix
def evidence_peptide_length(df): if any( column not in df.columns - for column in ["length", "sequence"] + for column in ["length", "raw file"] ): return Nonepmultiqc/modules/common/dia_utils.py (1)
323-337: Fix membership test syntax (E713) and add safety for null values.Per static analysis, use
not ininstead ofnot ... in. Also consider handling potential null values inStripped.Sequence.♻️ Suggested fix
def _get_peptide_length(df): - if not "Stripped.Sequence" in df.columns: + if "Stripped.Sequence" not in df.columns: return None df_sub = df[["Run", "Stripped.Sequence"]].copy() - df_sub["length"] = df_sub["Stripped.Sequence"].apply(lambda x: len(x)) + df_sub = df_sub.dropna(subset=["Stripped.Sequence"]) + df_sub["length"] = df_sub["Stripped.Sequence"].apply(len) plot_data = {} for run, group in df_sub.groupby("Run"): stats_dict = group["length"].value_counts().sort_index().to_dict() plot_data[run] = stats_dict return plot_datapmultiqc/modules/common/plots/id.py (1)
1137-1167: LGTM with minor naming suggestion.The function is well-structured and follows the established pattern for plotting functions. The configuration and helptext are appropriate.
Consider renaming
box_htmltoline_htmlfor clarity since this useslinegraph.plot, not a box plot:♻️ Optional naming improvement
- box_html = linegraph.plot(plot_data, pconfig=draw_config) - - box_html = plot_html_check(box_html) + line_html = linegraph.plot(plot_data, pconfig=draw_config) + + line_html = plot_html_check(line_html) add_sub_section( sub_section=sub_section, - plot=box_html, + plot=line_html, order=8,
| # Peptide Length Distribution | ||
| @staticmethod | ||
| def draw_peptide_length(sub_section, peptide_length: list): | ||
|
|
||
| if not peptide_length: | ||
| log.warning("No peptide_length data; skipping peptide_length.") | ||
| return | ||
|
|
||
| df = pd.concat(peptide_length, ignore_index=True) | ||
|
|
||
| if df.empty: | ||
| log.warning("Peptide Length DataFrame is empty; skipping Peptide Length.") | ||
| return | ||
|
|
||
| log.info(f"Number of Peptide Length rows in DataFrame: {len(df)}") | ||
|
|
||
| plot_data = {} | ||
| for sample, group in df.groupby("Run"): | ||
| stats_dict = group["Peptide Length"].value_counts().sort_index().to_dict() | ||
| plot_data[sample] = stats_dict | ||
|
|
||
| draw_peptide_length_distribution( | ||
| sub_section=sub_section, | ||
| plot_data=plot_data | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the file and context around the flagged lines
cd /tmp && find . -type f -name "fragpipe.py" -path "*/modules/fragpipe/*" 2>/dev/null | head -5Repository: bigbio/pmultiqc
Length of output: 41
🏁 Script executed:
# Search for the file in the repository
find . -type f -name "fragpipe.py" 2>/dev/null | grep -i fragpipeRepository: bigbio/pmultiqc
Length of output: 98
🏁 Script executed:
# List files to understand structure
git ls-files | grep -i fragpipe | head -20Repository: bigbio/pmultiqc
Length of output: 264
🏁 Script executed:
# Read the fragpipe.py file and check the relevant section and surrounding context
head -950 pmultiqc/modules/fragpipe/fragpipe.py | tail -150Repository: bigbio/pmultiqc
Length of output: 5013
🏁 Script executed:
# Search for where peptide_length is populated
rg "peptide_length" pmultiqc/modules/fragpipe/ -nRepository: bigbio/pmultiqc
Length of output: 1277
🏁 Script executed:
# Look at fragpipe_io.py to understand data parsing
cat -n pmultiqc/modules/fragpipe/fragpipe_io.py | head -200Repository: bigbio/pmultiqc
Length of output: 8723
🏁 Script executed:
# Check the test PSM file to see what "Peptide Length" values look like
head -20 tests/resources/fragpipe/psm.tsv | cut -f1-20Repository: bigbio/pmultiqc
Length of output: 5803
🏁 Script executed:
# Get column headers and check for Peptide Length column
head -1 tests/resources/fragpipe/psm.tsv | tr '\t' '\n' | nlRepository: bigbio/pmultiqc
Length of output: 101
🏁 Script executed:
# Extract just the Peptide Length column from the test file
cut -f $(head -1 tests/resources/fragpipe/psm.tsv | tr '\t' '\n' | grep -n "Peptide Length" | cut -d: -f1) tests/resources/fragpipe/psm.tsv | head -20Repository: bigbio/pmultiqc
Length of output: 115
🏁 Script executed:
# Check the exact data types when read by psm_reader
python3 << 'EOF'
import pandas as pd
# Simulate what psm_reader does
psm_df = pd.read_csv("tests/resources/fragpipe/psm.tsv", sep="\t")
# Check the Peptide Length column type
print(f"Peptide Length dtype: {psm_df['Peptide Length'].dtype}")
print(f"Sample values: {psm_df['Peptide Length'].head(10).tolist()}")
# Test the issue - how value_counts().sort_index() behaves
test_dict = psm_df[["Peptide Length"]].groupby(psm_df["Peptide Length"]).size().to_dict()
print(f"\nvalue_counts().sort_index() result (as dict): {test_dict}")
# If it were string-typed, show the difference
psm_df["Peptide Length"] = psm_df["Peptide Length"].astype(str)
test_dict_str = psm_df[["Peptide Length"]].groupby(psm_df["Peptide Length"]).size().to_dict()
print(f"String-typed sort_index() result: {test_dict_str}")
EOFRepository: bigbio/pmultiqc
Length of output: 145
🏁 Script executed:
# Check what happens in the actual draw_peptide_length function with the current data
python3 << 'EOF'
import pandas as pd
# Simulate what happens in draw_peptide_length
psm_df = pd.read_csv("tests/resources/fragpipe/psm.tsv", sep="\t")
# Extract Run and Peptide Length as done in line 519
peptide_length_data = [psm_df[["Run", "Peptide Length"]].copy()]
# Now simulate the draw_peptide_length function
df = pd.concat(peptide_length_data, ignore_index=True)
print(f"After concat, dtype: {df['Peptide Length'].dtype}")
# Simulate the groupby and value_counts operations
for sample, group in df.groupby("Run"):
stats_dict = group["Peptide Length"].value_counts().sort_index().to_dict()
print(f"Sample: {sample}")
print(f"Stats dict: {stats_dict}")
print(f"Keys type: {type(list(stats_dict.keys())[0]) if stats_dict else 'N/A'}")
break # Just check the first one
EOFRepository: bigbio/pmultiqc
Length of output: 145
🏁 Script executed:
# Check if there's any explicit dtype conversion in the psm_reader that I might have missed
sed -n '125,138p' pmultiqc/modules/fragpipe/fragpipe_io.pyRepository: bigbio/pmultiqc
Length of output: 480
🏁 Script executed:
# Let me check the actual behavior with a simpler test
python3 << 'EOF'
import pandas as pd
import io
# Create test data exactly like what pandas would read from the TSV
data = """Run Peptide Length
sample1 10
sample1 8
sample1 7
sample1 10
sample1 11
sample1 2
sample1 1"""
df = pd.read_csv(io.StringIO(data), sep="\t")
print(f"Inferred dtype: {df['Peptide Length'].dtype}")
# Test current behavior
group_dict = df["Peptide Length"].value_counts().sort_index().to_dict()
print(f"Current sort_index() result: {group_dict}")
print(f"Order of keys: {list(group_dict.keys())}")
# Now test if it were string-typed (the concern in the review)
df_str = df.copy()
df_str["Peptide Length"] = df_str["Peptide Length"].astype(str)
print(f"\nString dtype: {df_str['Peptide Length'].dtype}")
group_dict_str = df_str["Peptide Length"].value_counts().sort_index().to_dict()
print(f"String sort_index() result: {group_dict_str}")
print(f"Order of keys (WRONG): {list(group_dict_str.keys())}")
EOFRepository: bigbio/pmultiqc
Length of output: 177
🏁 Script executed:
# Check if there's any test for the peptide_length function to see how it's actually used
rg "draw_peptide_length" tests/ -A 5 -B 2Repository: bigbio/pmultiqc
Length of output: 41
🏁 Script executed:
# Check the fragpipe module test file
head -100 tests/test_fragpipe.pyRepository: bigbio/pmultiqc
Length of output: 4112
🏁 Script executed:
# Check how pandas infers the dtype when reading the test PSM file
python3 << 'EOF'
import pandas as pd
# Read the test PSM file as pandas would
df = pd.read_csv("tests/resources/fragpipe/psm.tsv", sep="\t", nrows=10)
# Check the dtype of Peptide Length
print(f"Peptide Length dtype: {df['Peptide Length'].dtype}")
print(f"Sample values: {df['Peptide Length'].tolist()}")
# Check what value_counts().sort_index() produces
vc = df["Peptide Length"].value_counts().sort_index()
print(f"\nvalue_counts().sort_index() keys: {list(vc.index)}")
print(f"Keys are numeric: {all(isinstance(k, (int, float)) for k in vc.index)}")
EOFRepository: bigbio/pmultiqc
Length of output: 177
🏁 Script executed:
# Check how the data flows through the code - look at where peptide_length is populated
sed -n '447,555p' pmultiqc/modules/fragpipe/fragpipe.py | head -120Repository: bigbio/pmultiqc
Length of output: 3864
🏁 Script executed:
# Check how value_counts and sort_index interact in different scenarios
# Let me look at the exact code pattern more carefully
sed -n '893,920p' pmultiqc/modules/fragpipe/fragpipe.pyRepository: bigbio/pmultiqc
Length of output: 908
🏁 Script executed:
# Let's also compare the two similar functions side by side to see the difference
# draw_ids_over_rt (which has numeric conversion) vs draw_peptide_length (which doesn't)
sed -n '925,950p' pmultiqc/modules/fragpipe/fragpipe.pyRepository: bigbio/pmultiqc
Length of output: 780
🏁 Script executed:
# Check if there are any other similar patterns in the file where numeric conversion is used
rg "pd.to_numeric" pmultiqc/modules/fragpipe/fragpipe.py -B 2 -A 2Repository: bigbio/pmultiqc
Length of output: 348
Ensure peptide lengths are numeric to avoid lexicographic bin ordering.
The "Peptide Length" column can be read as string-typed by pandas, causing sort_index() to order as 1,10,11,2... instead of 1,2,10,11.... This pattern is already used in other similar functions in the same file (e.g., draw_delta_mass(), draw_ids_over_rt()), so apply the same defensive approach here.
Proposed fix
df = pd.concat(peptide_length, ignore_index=True)
+ df["Peptide Length"] = pd.to_numeric(df["Peptide Length"], errors="coerce")
+ df = df.dropna(subset=["Peptide Length"])
if df.empty:
log.warning("Peptide Length DataFrame is empty; skipping Peptide Length.")
return
log.info(f"Number of Peptide Length rows in DataFrame: {len(df)}")
plot_data = {}
for sample, group in df.groupby("Run"):
- stats_dict = group["Peptide Length"].value_counts().sort_index().to_dict()
+ stats_dict = (
+ group["Peptide Length"]
+ .astype(int)
+ .value_counts()
+ .sort_index()
+ .to_dict()
+ )
plot_data[sample] = stats_dict🤖 Prompt for AI Agents
In `@pmultiqc/modules/fragpipe/fragpipe.py` around lines 893 - 917, The "Peptide
Length" column may be string-typed causing lexicographic ordering; inside
draw_peptide_length convert df["Peptide Length"] to numeric (use
pandas.to_numeric with errors='coerce'), drop or dropna invalid entries,
optionally cast to int, then proceed to compute value_counts() and sort_index()
so bins are ordered numerically; update references to df and the grouping logic
in draw_peptide_length to use the cleaned numeric column (mirroring the approach
used in draw_delta_mass() and draw_ids_over_rt()) before building plot_data for
draw_peptide_length_distribution.
Pull Request
Description
Brief description of the changes made in this PR.
Type of Change
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.