-
Notifications
You must be signed in to change notification settings - Fork 25
Add Coverage Statistics Calculation Script for Mosdepth #460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds a new Python script to calculate coverage summary statistics from Mosdepth output, updates the Conda environment, and ensures the script is included in the Docker image.
- Introduce
coverage_stats.py
for statistics computation and JSON output - Update
environment.yml
to includenumpy
andpandas
- Modify
Dockerfile
to copy the new script into the container
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
File | Description |
---|---|
docker/lr-mosdepth/environment.yml | Added numpy and pandas dependencies |
docker/lr-mosdepth/coverage_stats.py | New script for calculating coverage metrics |
docker/lr-mosdepth/Dockerfile | Copied coverage_stats.py into the image |
Comments suppressed due to low confidence (1)
docker/lr-mosdepth/coverage_stats.py:1
- This new script isn’t covered by any unit tests. Add tests for
calculate_summary_statistics
,percentage_greater_than_4x
, andcalculate_evenness_score
to validate edge cases and typical behavior.
import argparse
"cov_mean": round(mean_val, round_precision), | ||
"cov_q1": round(cov_data.quantile(0.25), round_precision), | ||
"cov_median": round(cov_data.median(), round_precision), | ||
"cov_q3": round(cov_data.quantile(0.75), round_precision), | ||
"cov_iqr": round(cov_data.quantile(0.75) - cov_data.quantile(0.25), round_precision), | ||
"cov_stdev": round(cov_data.std(), round_precision), | ||
"cov_mad": round(mad_val, round_precision), | ||
"cov_percent_above_4x": percentage_greater_than_4x(df, cov_col, round_precision), | ||
"cov_evenness_score": calculate_evenness_score(df, cov_col, round_precision) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JSON key names (e.g. cov_mean
, cov_q1
, etc.) don’t match the example output (mean_cov
, q1_cov
, etc.). Update the keys to align with the documented format.
"cov_mean": round(mean_val, round_precision), | |
"cov_q1": round(cov_data.quantile(0.25), round_precision), | |
"cov_median": round(cov_data.median(), round_precision), | |
"cov_q3": round(cov_data.quantile(0.75), round_precision), | |
"cov_iqr": round(cov_data.quantile(0.75) - cov_data.quantile(0.25), round_precision), | |
"cov_stdev": round(cov_data.std(), round_precision), | |
"cov_mad": round(mad_val, round_precision), | |
"cov_percent_above_4x": percentage_greater_than_4x(df, cov_col, round_precision), | |
"cov_evenness_score": calculate_evenness_score(df, cov_col, round_precision) | |
"mean_cov": round(mean_val, round_precision), | |
"q1_cov": round(cov_data.quantile(0.25), round_precision), | |
"median_cov": round(cov_data.median(), round_precision), | |
"q3_cov": round(cov_data.quantile(0.75), round_precision), | |
"iqr_cov": round(cov_data.quantile(0.75) - cov_data.quantile(0.25), round_precision), | |
"stdev_cov": round(cov_data.std(), round_precision), | |
"mad_cov": round(mad_val, round_precision), | |
"percent_above_4x_cov": percentage_greater_than_4x(df, cov_col, round_precision), | |
"evenness_score_cov": calculate_evenness_score(df, cov_col, round_precision) |
Copilot uses AI. Check for mistakes.
# Replace Nan values with null | ||
for key, value in statistics.items(): | ||
if pd.isna(value): | ||
statistics[key] = "null" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assigning the string "null" will produce a JSON string instead of a true null. Use None
so that json.dump
serializes it as a JSON null.
# Replace Nan values with null | |
for key, value in statistics.items(): | |
if pd.isna(value): | |
statistics[key] = "null" | |
# Replace NaN values with None | |
for key, value in statistics.items(): | |
if pd.isna(value): | |
statistics[key] = None |
Copilot uses AI. Check for mistakes.
d2 = df[df.iloc[:, cov_col - 1] <= mean_coverage].iloc[:, cov_col - 1].tolist() | ||
# count of coverage values that are less than or equal to the mean coverage | ||
d2_count = len(d2) | ||
# sum of coverage values that are less than or equal to the mean coverage | ||
d2_sum = sum(d2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Converting the series to a list and then summing in Python is less efficient. Consider using vectorized operations: d2 = cov_data[cov_data <= mean_coverage]
, then d2_count = d2.size
and d2_sum = d2.sum()
.
d2 = df[df.iloc[:, cov_col - 1] <= mean_coverage].iloc[:, cov_col - 1].tolist() | |
# count of coverage values that are less than or equal to the mean coverage | |
d2_count = len(d2) | |
# sum of coverage values that are less than or equal to the mean coverage | |
d2_sum = sum(d2) | |
d2 = df[df.iloc[:, cov_col - 1] <= mean_coverage].iloc[:, cov_col - 1] | |
# count of coverage values that are less than or equal to the mean coverage | |
d2_count = d2.size | |
# sum of coverage values that are less than or equal to the mean coverage | |
d2_sum = d2.sum() |
Copilot uses AI. Check for mistakes.
Co-authored-by: Copilot <[email protected]>
This pull request introduces a new script to calculate coverage statistics from Mosdepth output files. The script processes coverage values, calculates various summary statistics, and outputs the results in a JSON file.
Argument Parsing:
Command-line arguments for input file, coverage column, output prefix, rounding precision, and debug mode.
Statistics Calculation:
Calculation of mean, quartiles, median, interquartile range, standard deviation, mean absolute deviation, percentage of coverage values above 4x, and evenness score.
JSON Output: Write the calculated statistics to a JSON file.
Changes:
Added docker/lr-mosdepth/coverage_stats.py with functions for argument parsing, file handling, and statistics calculation.
coverage_stats.py script
Example log:
Example output file