Add Coverage Statistics Calculation Script for Mosdepth #460

bshifaw · 2024-07-26T18:17:08Z

This pull request introduces a new script to calculate coverage statistics from Mosdepth output files. The script processes coverage values, calculates various summary statistics, and outputs the results in a JSON file.

Argument Parsing:
Command-line arguments for input file, coverage column, output prefix, rounding precision, and debug mode.

Statistics Calculation:
Calculation of mean, quartiles, median, interquartile range, standard deviation, mean absolute deviation, percentage of coverage values above 4x, and evenness score.
JSON Output: Write the calculated statistics to a JSON file.

Changes:
Added docker/lr-mosdepth/coverage_stats.py with functions for argument parsing, file handling, and statistics calculation.

coverage_stats.py script
Example log:

> python /Users/longreadpipes/docker/lr-mosdepth/coverage_stats.py  --cov_col 4 --round 2 --output_prefix test_example.coverage_over_bed test_example.text  

INFO:root:Arguments: Namespace(mosdepth_regions='test_example.text', cov_col=4, output_prefix='test_example.coverage_over_bed', round=2, debug=False)
INFO:root:Calculating coverage statistics
INFO:root:Opened file: test_example.text
INFO:root:Percentage of coverage values greater than 4x: 1.0
INFO:root:Evenness score: 0.92
INFO:root:Summary statistics: {'mean_cov': 16.0, 'q1_cov': 15.0, 'median_cov': 16.0, 'q3_cov': 17.75, 'iqr_cov': 2.75, 'sstdev_cov': 3.69, 'mad_cov': 2.67, 'percent_above_4x': 1.0, 'evenness_score': 0.92}
INFO:root:Writing summary statistics to file: test_example.coverage_over_bed.cov_stat_summary.json

Example output file

> cat test_example.coverage_over_bed.cov_stat_summary.json  

{"mean_cov": 16.0, "q1_cov": 15.0, "median_cov": 16.0, "q3_cov": 17.75, "iqr_cov": 2.75, "sstdev_cov": 3.69, "mad_cov": 2.67, "percent_above_4x": 1.0, "evenness_score": 0.92}

Copilot

Pull Request Overview

Adds a new Python script to calculate coverage summary statistics from Mosdepth output, updates the Conda environment, and ensures the script is included in the Docker image.

Introduce coverage_stats.py for statistics computation and JSON output
Update environment.yml to include numpy and pandas
Modify Dockerfile to copy the new script into the container

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
docker/lr-mosdepth/environment.yml	Added `numpy` and `pandas` dependencies
docker/lr-mosdepth/coverage_stats.py	New script for calculating coverage metrics
docker/lr-mosdepth/Dockerfile	Copied `coverage_stats.py` into the image

Comments suppressed due to low confidence (1)

docker/lr-mosdepth/coverage_stats.py:1

This new script isn’t covered by any unit tests. Add tests for calculate_summary_statistics, percentage_greater_than_4x, and calculate_evenness_score to validate edge cases and typical behavior.

import argparse

Copilot · 2025-06-06T18:39:10Z

docker/lr-mosdepth/coverage_stats.py

+        "cov_mean": round(mean_val, round_precision),
+        "cov_q1": round(cov_data.quantile(0.25), round_precision),
+        "cov_median": round(cov_data.median(), round_precision),
+        "cov_q3": round(cov_data.quantile(0.75), round_precision),
+        "cov_iqr": round(cov_data.quantile(0.75) - cov_data.quantile(0.25), round_precision),
+        "cov_stdev": round(cov_data.std(), round_precision),
+        "cov_mad": round(mad_val, round_precision),
+        "cov_percent_above_4x": percentage_greater_than_4x(df, cov_col, round_precision),
+        "cov_evenness_score": calculate_evenness_score(df, cov_col, round_precision)


JSON key names (e.g. cov_mean, cov_q1, etc.) don’t match the example output (mean_cov, q1_cov, etc.). Update the keys to align with the documented format.

Suggested change

"cov_mean": round(mean_val, round_precision),

"cov_q1": round(cov_data.quantile(0.25), round_precision),

"cov_median": round(cov_data.median(), round_precision),

"cov_q3": round(cov_data.quantile(0.75), round_precision),

"cov_iqr": round(cov_data.quantile(0.75) - cov_data.quantile(0.25), round_precision),

"cov_stdev": round(cov_data.std(), round_precision),

"cov_mad": round(mad_val, round_precision),

"cov_percent_above_4x": percentage_greater_than_4x(df, cov_col, round_precision),

"cov_evenness_score": calculate_evenness_score(df, cov_col, round_precision)

"mean_cov": round(mean_val, round_precision),

"q1_cov": round(cov_data.quantile(0.25), round_precision),

"median_cov": round(cov_data.median(), round_precision),

"q3_cov": round(cov_data.quantile(0.75), round_precision),

"iqr_cov": round(cov_data.quantile(0.75) - cov_data.quantile(0.25), round_precision),

"stdev_cov": round(cov_data.std(), round_precision),

"mad_cov": round(mad_val, round_precision),

"percent_above_4x_cov": percentage_greater_than_4x(df, cov_col, round_precision),

"evenness_score_cov": calculate_evenness_score(df, cov_col, round_precision)

Copilot · 2025-06-06T18:39:11Z

docker/lr-mosdepth/coverage_stats.py

+    # Replace Nan values with null
+    for key, value in statistics.items():
+        if pd.isna(value):
+            statistics[key] = "null"


Assigning the string "null" will produce a JSON string instead of a true null. Use None so that json.dump serializes it as a JSON null.

Suggested change

# Replace Nan values with null

for key, value in statistics.items():

if pd.isna(value):

statistics[key] = "null"

# Replace NaN values with None

for key, value in statistics.items():

if pd.isna(value):

statistics[key] = None

docker/lr-mosdepth/coverage_stats.py

Copilot · 2025-06-06T18:39:11Z

docker/lr-mosdepth/coverage_stats.py

+    d2 = df[df.iloc[:, cov_col - 1] <= mean_coverage].iloc[:, cov_col - 1].tolist()
+    # count of coverage values that are less than or equal to the mean coverage
+    d2_count = len(d2)
+    # sum of coverage values that are less than or equal to the mean coverage
+    d2_sum = sum(d2)


Converting the series to a list and then summing in Python is less efficient. Consider using vectorized operations: d2 = cov_data[cov_data <= mean_coverage], then d2_count = d2.size and d2_sum = d2.sum().

Suggested change

d2 = df[df.iloc[:, cov_col - 1] <= mean_coverage].iloc[:, cov_col - 1].tolist()

# count of coverage values that are less than or equal to the mean coverage

d2_count = len(d2)

# sum of coverage values that are less than or equal to the mean coverage

d2_sum = sum(d2)

d2 = df[df.iloc[:, cov_col - 1] <= mean_coverage].iloc[:, cov_col - 1]

# count of coverage values that are less than or equal to the mean coverage

d2_count = d2.size

# sum of coverage values that are less than or equal to the mean coverage

d2_sum = d2.sum()

Co-authored-by: Copilot <[email protected]>

Added script to dockerfile to output coverage summary json

72cb659

bshifaw requested a review from SHuang-Broad July 26, 2024 18:17

bshifaw added 7 commits July 26, 2024 14:39

refactor coverage_stats.py

310826f

add reference comment to evenness score

44b6878

"Addition: added numpy to docker dependencies"

686ff54

"Addition: add pandas to environment.yml"

bdc4f4e

"Addition: replace NaN values with 'null'"

13d51b3

"Addition: replace pd NaN values with 'null'"

13e1c11

"Addition: changed single to double quotes

ef3574f

bshifaw requested a review from Copilot June 6, 2025 18:37

Copilot AI reviewed Jun 6, 2025

View reviewed changes

Update docker/lr-mosdepth/coverage_stats.py

229eb1d

Co-authored-by: Copilot <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Coverage Statistics Calculation Script for Mosdepth #460

Add Coverage Statistics Calculation Script for Mosdepth #460

Uh oh!

bshifaw commented Jul 26, 2024 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jun 6, 2025

Uh oh!

Copilot AI Jun 6, 2025

Uh oh!

Uh oh!

Copilot AI Jun 6, 2025

Uh oh!

Uh oh!

Add Coverage Statistics Calculation Script for Mosdepth #460

Are you sure you want to change the base?

Add Coverage Statistics Calculation Script for Mosdepth #460

Uh oh!

Conversation

bshifaw commented Jul 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI Jun 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

bshifaw commented Jul 26, 2024 •

edited

Loading