Skip to content

Conversation

bshifaw
Copy link
Contributor

@bshifaw bshifaw commented Jul 26, 2024

This pull request introduces a new script to calculate coverage statistics from Mosdepth output files. The script processes coverage values, calculates various summary statistics, and outputs the results in a JSON file.

Argument Parsing:
Command-line arguments for input file, coverage column, output prefix, rounding precision, and debug mode.

Statistics Calculation:
Calculation of mean, quartiles, median, interquartile range, standard deviation, mean absolute deviation, percentage of coverage values above 4x, and evenness score.
JSON Output: Write the calculated statistics to a JSON file.

Changes:
Added docker/lr-mosdepth/coverage_stats.py with functions for argument parsing, file handling, and statistics calculation.

coverage_stats.py script
Example log:

> python /Users/longreadpipes/docker/lr-mosdepth/coverage_stats.py  --cov_col 4 --round 2 --output_prefix test_example.coverage_over_bed test_example.text  

INFO:root:Arguments: Namespace(mosdepth_regions='test_example.text', cov_col=4, output_prefix='test_example.coverage_over_bed', round=2, debug=False)
INFO:root:Calculating coverage statistics
INFO:root:Opened file: test_example.text
INFO:root:Percentage of coverage values greater than 4x: 1.0
INFO:root:Evenness score: 0.92
INFO:root:Summary statistics: {'mean_cov': 16.0, 'q1_cov': 15.0, 'median_cov': 16.0, 'q3_cov': 17.75, 'iqr_cov': 2.75, 'sstdev_cov': 3.69, 'mad_cov': 2.67, 'percent_above_4x': 1.0, 'evenness_score': 0.92}
INFO:root:Writing summary statistics to file: test_example.coverage_over_bed.cov_stat_summary.json

Example output file

> cat test_example.coverage_over_bed.cov_stat_summary.json  

{"mean_cov": 16.0, "q1_cov": 15.0, "median_cov": 16.0, "q3_cov": 17.75, "iqr_cov": 2.75, "sstdev_cov": 3.69, "mad_cov": 2.67, "percent_above_4x": 1.0, "evenness_score": 0.92}   

@bshifaw bshifaw requested a review from SHuang-Broad July 26, 2024 18:17
@bshifaw bshifaw requested a review from Copilot June 6, 2025 18:37
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds a new Python script to calculate coverage summary statistics from Mosdepth output, updates the Conda environment, and ensures the script is included in the Docker image.

  • Introduce coverage_stats.py for statistics computation and JSON output
  • Update environment.yml to include numpy and pandas
  • Modify Dockerfile to copy the new script into the container

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
docker/lr-mosdepth/environment.yml Added numpy and pandas dependencies
docker/lr-mosdepth/coverage_stats.py New script for calculating coverage metrics
docker/lr-mosdepth/Dockerfile Copied coverage_stats.py into the image
Comments suppressed due to low confidence (1)

docker/lr-mosdepth/coverage_stats.py:1

  • This new script isn’t covered by any unit tests. Add tests for calculate_summary_statistics, percentage_greater_than_4x, and calculate_evenness_score to validate edge cases and typical behavior.
import argparse

Comment on lines +39 to +47
"cov_mean": round(mean_val, round_precision),
"cov_q1": round(cov_data.quantile(0.25), round_precision),
"cov_median": round(cov_data.median(), round_precision),
"cov_q3": round(cov_data.quantile(0.75), round_precision),
"cov_iqr": round(cov_data.quantile(0.75) - cov_data.quantile(0.25), round_precision),
"cov_stdev": round(cov_data.std(), round_precision),
"cov_mad": round(mad_val, round_precision),
"cov_percent_above_4x": percentage_greater_than_4x(df, cov_col, round_precision),
"cov_evenness_score": calculate_evenness_score(df, cov_col, round_precision)
Copy link
Preview

Copilot AI Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JSON key names (e.g. cov_mean, cov_q1, etc.) don’t match the example output (mean_cov, q1_cov, etc.). Update the keys to align with the documented format.

Suggested change
"cov_mean": round(mean_val, round_precision),
"cov_q1": round(cov_data.quantile(0.25), round_precision),
"cov_median": round(cov_data.median(), round_precision),
"cov_q3": round(cov_data.quantile(0.75), round_precision),
"cov_iqr": round(cov_data.quantile(0.75) - cov_data.quantile(0.25), round_precision),
"cov_stdev": round(cov_data.std(), round_precision),
"cov_mad": round(mad_val, round_precision),
"cov_percent_above_4x": percentage_greater_than_4x(df, cov_col, round_precision),
"cov_evenness_score": calculate_evenness_score(df, cov_col, round_precision)
"mean_cov": round(mean_val, round_precision),
"q1_cov": round(cov_data.quantile(0.25), round_precision),
"median_cov": round(cov_data.median(), round_precision),
"q3_cov": round(cov_data.quantile(0.75), round_precision),
"iqr_cov": round(cov_data.quantile(0.75) - cov_data.quantile(0.25), round_precision),
"stdev_cov": round(cov_data.std(), round_precision),
"mad_cov": round(mad_val, round_precision),
"percent_above_4x_cov": percentage_greater_than_4x(df, cov_col, round_precision),
"evenness_score_cov": calculate_evenness_score(df, cov_col, round_precision)

Copilot uses AI. Check for mistakes.

Comment on lines +50 to +53
# Replace Nan values with null
for key, value in statistics.items():
if pd.isna(value):
statistics[key] = "null"
Copy link
Preview

Copilot AI Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assigning the string "null" will produce a JSON string instead of a true null. Use None so that json.dump serializes it as a JSON null.

Suggested change
# Replace Nan values with null
for key, value in statistics.items():
if pd.isna(value):
statistics[key] = "null"
# Replace NaN values with None
for key, value in statistics.items():
if pd.isna(value):
statistics[key] = None

Copilot uses AI. Check for mistakes.

Comment on lines +93 to +97
d2 = df[df.iloc[:, cov_col - 1] <= mean_coverage].iloc[:, cov_col - 1].tolist()
# count of coverage values that are less than or equal to the mean coverage
d2_count = len(d2)
# sum of coverage values that are less than or equal to the mean coverage
d2_sum = sum(d2)
Copy link
Preview

Copilot AI Jun 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Converting the series to a list and then summing in Python is less efficient. Consider using vectorized operations: d2 = cov_data[cov_data <= mean_coverage], then d2_count = d2.size and d2_sum = d2.sum().

Suggested change
d2 = df[df.iloc[:, cov_col - 1] <= mean_coverage].iloc[:, cov_col - 1].tolist()
# count of coverage values that are less than or equal to the mean coverage
d2_count = len(d2)
# sum of coverage values that are less than or equal to the mean coverage
d2_sum = sum(d2)
d2 = df[df.iloc[:, cov_col - 1] <= mean_coverage].iloc[:, cov_col - 1]
# count of coverage values that are less than or equal to the mean coverage
d2_count = d2.size
# sum of coverage values that are less than or equal to the mean coverage
d2_sum = d2.sum()

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant