Skip to content

feat: make release grch37 optional #105

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open

feat: make release grch37 optional #105

wants to merge 15 commits into from

Conversation

stolpeo
Copy link

@stolpeo stolpeo commented Jun 16, 2025

  • Update version numbers of mehari, annonars, server worker and viguno

Summary by CodeRabbit

  • New Features

    • Added support for specifying RefSeq or Ensembl annotations in Mehari transcript data outputs.
    • Introduced a new aspect "Past Medical History" for disease integration.
  • Bug Fixes

    • Corrected input directory paths for gnomAD genome and exome frequency data.
  • Improvements

    • Updated multiple data source versions, including ClinVar, HPO, and Mehari transcript data.
    • Enhanced resource management by centralizing thread and memory settings across workflows.
    • Switched various download URLs to more secure or reliable sources.
    • Improved workflow flexibility by dynamically generating input files.
    • Added new package dependencies and updated versions for key bioinformatics tools.
    • Updated download URLs for dbNSFP and dbscSNV datasets to new reliable sources.
    • Changed download protocols to HTTPS for improved security.
    • Disabled strict format checks in certain data processing steps to improve robustness.
    • Updated Ensembl and HGNC data source URLs to latest archive and cloud storage locations.
    • Improved filename extraction from URLs during data downloads to handle redirects and headers.
  • Chores

    • Removed obsolete installation script for annonars.
    • Removed annonars installation step from CI workflow.

Copy link
Contributor

coderabbitai bot commented Jun 16, 2025

Important

Review skipped

More than 25% of the files skipped due to max files limit. The review is being skipped to prevent a low-quality review.

149 files out of 257 files are above the max files limit of 100. Please upgrade to Pro plan to get higher limits.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

This update introduces centralized resource management in Snakemake rules by replacing per-rule environment variable lookups with global THREADS and MEMORY constants. Workflow input file specification is now dynamically generated via a new function. Several tool and data versions are updated, download URLs are modernized to HTTPS or alternative hosts, and minor logic adjustments improve maintainability and consistency.

Changes

File(s) Change Summary
Snakefile Added global constants GENOMEBUILDS, THREADS, MEMORY; replaced static rule all inputs with dynamic generate_input_files() function.
rules/output/annonars/.smk, rules/output/mehari/freqs.smk, rules/output/viguno/hpo.smk, rules/reduced/.smk Replaced per-rule environment variable thread/memory lookups with global THREADS and MEMORY constants; updated resource declarations accordingly.
environment.yml Upgraded bioinformatics tools (bcftools, htslib, samtools) and Varfish packages (annonars, viguno, mehari, varfish-server-worker); added qsv; removed wget.
rules/work/annos/features/tads.smk, rules/work/annos/seqvars/dbnsfp.smk, rules/work/annos/seqvars/dbscsnv.smk Switched download URLs to HTTPS or replaced FTP with HTTPS; added conditional/static URL selection logic for dbNSFP variants.
rules/work/annos/seqvars/gnomad_nuclear.smk Simplified input file path function by removing conditional branching and using a single template.
rules/work/genes/ensembl.smk, rules/work/genes/hgnc.smk Updated data download URLs to May 2024 Ensembl archive and Google Cloud Storage; relaxed Ensembl version check from error to warning.
rules/work/genes/gnomad.smk, rules/work/genes/omim.smk, rules/work/genes/rcnv.smk, rules/work/genes/shet.smk Added environment variable QSV_SKIP_FORMAT_CHECK=1 to qsv commands to disable format checks during join/select operations.
rules/work/genes/mehari_data_tx.smk Added {refseq_or_ensembl} wildcard to file naming patterns in download and copy rules for Mehari transcript data.
scripts/genes-integrate-diseases.py Added new enum member PAST_MEDICAL_HISTORY to Aspect enum.
scripts/genes-orpha-diseases.py Added 60-second timeout to HTTP GET request fetching ORPHAcode cross-reference data.
utils/install-annonars.sh Removed script that extracted annonars binary from Docker image.
varfish_db_downloader/versions.py Updated default version strings for ClinVar, HPO, Mehari transcript data; added comments clarifying some version choices.
download_urls.yml Updated URLs for dbNSFP and dbscSNV data files to new HTTPS Box.com shared links; updated ClinVar and Mehari URLs; changed HGNC URL to Google Cloud Storage.
varfish_db_downloader/wget.py Added function to robustly extract filename from URL via HTTP headers or final redirected URL; updated download_excerpt to use this function.

Poem

Oh, what a hop through the fields of code,
Where threads and memory now share their load.
URLs are safer, versions anew,
With wildcards and constants, the workflow grew.
The rabbit leaps, the data flows—
In this garden, everything grows!
🐇✨


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

Please check for formatting and linting issues. You can use make lint to:

  • Check your Bash style with beautysh and formatting with shellcheck.
  • Check your Python style with black, linting with flake8.
  • Check your Snakemake style with snakefmt.

@stolpeo stolpeo enabled auto-merge (squash) June 16, 2025 06:29
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🔭 Outside diff range comments (2)
rules/output/mehari/freqs.smk (1)

81-84: ⚠️ Potential issue

Variable typo will break template rendering

--value v_annovars={wildcards.v_annonars} mixes two spellings. Unless the template expects the misspelled key, this will leave the field empty.

-            --value v_annovars={wildcards.v_annonars} \
+            --value v_annonars={wildcards.v_annonars} \
rules/reduced/annonars.smk (1)

33-41: ⚠️ Potential issue

Wildcard name mismatch blocks rule execution

Output patterns use {version_multi} but wildcard_constraints declare versions. Snakemake will fail to resolve the rule.

-        versions=RE_VERSION_MULTI,
+        version_multi=RE_VERSION_MULTI,
♻️ Duplicate comments (5)
rules/output/annonars/gnomad_sv.smk (3)

73-77: Duplicate of the comment above – same consideration applies here.


126-130: Duplicate of the comment above – same consideration applies here.


178-182: Duplicate of the comment above – same consideration applies here.

rules/output/annonars/gnomad_exomes.smk (1)

19-22: Same resource sanity check as for genomes rule

Please verify that the global THREADS / MEMORY constants are initialised to integer values before rule
collection starts. Otherwise the scheduler might treat them as strings and ignore the settings.

rules/output/annonars/cadd.smk (1)

57-60: Global resource constants: double-check numeric types

The switch to the central THREADS / MEMORY variables is welcome.
Just make sure they are integers (see earlier comment) so the cluster executor picks them up.

🧹 Nitpick comments (11)
scripts/genes-orpha-diseases.py (1)

39-41: Extract the 60 s timeout into a named constant

The 60-second literal now appears twice (here and at line 43). Pull it up to a module-level constant such as HTTP_TIMEOUT = 60 so that future adjustments only touch one place.

-                        await client.get(URL_ORPHACODE_GET.format(orpha_id), timeout=60)
+                        await client.get(
+                            URL_ORPHACODE_GET.format(orpha_id),
+                            timeout=HTTP_TIMEOUT,
+                        )
rules/work/genes/shet.smk (1)

12-18: Re-evaluate necessity of QSV_SKIP_FORMAT_CHECK=1

The entire Weghorn table is joined against a small x-link TSV. Skipping format validation might mask column drift in future releases.
If the header mismatch is known and acceptable, add a short comment explaining why the check is disabled to aid future maintainers.

rules/work/genes/gnomad.smk (1)

164-170: Inconsistent use of QSV_SKIP_FORMAT_CHECK within the same pipeline

The flag is set for qsv select but not for the subsequent qsv join (lines 173-177). If the input really violates the CSV spec, the join will fail later anyway.
Either:

  1. remove the flag here, or
  2. export it once at the top of the shell block so all qsv invocations behave consistently.
-        QSV_SKIP_FORMAT_CHECK=1 \
-        qsv select {columns_src_str} $TMPDIR/tmp.txt \
+        qsv select {columns_src_str} $TMPDIR/tmp.txt \
rules/work/genes/hgnc.smk (1)

15-18: New HGNC URL works, but add fallback?

Moving to the GCS mirror is fine; however the EBI FTP site has been historically stable. Consider keeping it as a secondary mirror or at least documenting the alternative to avoid future outages.

rules/output/annonars/alphamissense.smk (1)

26-30: Threads constant may be disproportionate for single-file TSV import

THREADS is now global. The AlphaMissense TSV import is single-threaded in annonars (& is bounded by I/O). Requesting, say, 96 cores wastes resources and can impact overall throughput when sharing a cluster.

Either:

  1. Keep threads hard-coded at 1 for this rule, or
  2. Introduce a second constant (IO_THREADS / LIGHT_THREADS) with a small default.
rules/output/viguno/hpo.smk (1)

19-22: Flat THREADS/MEMORY may yield extremely over-allocated jobs

The rule only performs cp and a small awk, yet can now request the same slots and RAM as heavyweight processing steps. This may clog scheduling queues.

Recommend lowering the footprint, e.g.:

-threads: THREADS
-resources:
-    mem_mb=MEMORY,
+threads: 1
+resources:
+    mem_mb=1024,
rules/output/annonars/cons.smk (1)

15-18: Same memory-scaling concern as other rules

Switching from mem_mb_per_cpu to a fixed mem_mb=MEMORY alters the effective per-CPU memory.
Please confirm that the new constant is sized appropriately or compute it from THREADS.

rules/output/annonars/gnomad_sv.smk (1)

18-22: Global constants sound – double-check actual cluster submission

Good move to centralise THREADS/MEMORY, but make sure the cluster profile you submit to interprets mem_mb (total) rather than mem_mb_per_cpu. Otherwise the requested RAM will differ from before.

rules/output/annonars/gnomad_mtdna.smk (1)

19-22: Fixed memory amount may under- or over-estimate requirements

As with the other annonars rules, replacing mem_mb_per_cpu by a fixed mem_mb changes total RAM allocation. Please verify that MEMORY is tuned to typical input sizes and thread counts.

varfish_db_downloader/versions.py (1)

130-134: Inline comments push lines over 100 chars

The added end-of-line comments on ensembl_38, ensembl, and dbnsfp are the usual
culprits for Black violations. Move them above the assignment:

-    ensembl_38="112", # keep at 112 for consistency with mehari
-    ensembl="112", # keep at 112 for consistency with mehari
+    # keep at 112 for consistency with Mehari
+    ensembl_38="112",
+    ensembl="112",
Snakefile (1)

45-49: Consider lowering the default memory value.

The default memory of 196 GB seems quite high and may not be available on many systems. Consider a more conservative default like 32 GB (32000 MB).

-MEMORY = int(os.environ.get("MEMORY", "196000"))
+MEMORY = int(os.environ.get("MEMORY", "32000"))
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dd7c998 and 761e48f.

📒 Files selected for processing (33)
  • Snakefile (3 hunks)
  • environment.yml (1 hunks)
  • rules/output/annonars/alphamissense.smk (1 hunks)
  • rules/output/annonars/cadd.smk (1 hunks)
  • rules/output/annonars/cons.smk (1 hunks)
  • rules/output/annonars/dbnsfp.smk (1 hunks)
  • rules/output/annonars/dbscsnv.smk (1 hunks)
  • rules/output/annonars/dbsnp.smk (1 hunks)
  • rules/output/annonars/gnomad_exomes.smk (1 hunks)
  • rules/output/annonars/gnomad_genomes.smk (1 hunks)
  • rules/output/annonars/gnomad_mtdna.smk (1 hunks)
  • rules/output/annonars/gnomad_sv.smk (4 hunks)
  • rules/output/annonars/helix.smk (1 hunks)
  • rules/output/mehari/freqs.smk (2 hunks)
  • rules/output/viguno/hpo.smk (1 hunks)
  • rules/reduced/annonars.smk (1 hunks)
  • rules/reduced/hpo.smk (1 hunks)
  • rules/reduced/mehari.smk (1 hunks)
  • rules/work/annos/features/tads.smk (1 hunks)
  • rules/work/annos/seqvars/dbnsfp.smk (1 hunks)
  • rules/work/annos/seqvars/dbscsnv.smk (1 hunks)
  • rules/work/annos/seqvars/gnomad_nuclear.smk (1 hunks)
  • rules/work/genes/ensembl.smk (1 hunks)
  • rules/work/genes/gnomad.smk (1 hunks)
  • rules/work/genes/hgnc.smk (1 hunks)
  • rules/work/genes/mehari_data_tx.smk (1 hunks)
  • rules/work/genes/omim.smk (1 hunks)
  • rules/work/genes/rcnv.smk (1 hunks)
  • rules/work/genes/shet.smk (1 hunks)
  • scripts/genes-integrate-diseases.py (1 hunks)
  • scripts/genes-orpha-diseases.py (1 hunks)
  • utils/install-annonars.sh (0 hunks)
  • varfish_db_downloader/versions.py (3 hunks)
💤 Files with no reviewable changes (1)
  • utils/install-annonars.sh
🧰 Additional context used
🪛 GitHub Actions: CI
varfish_db_downloader/versions.py

[error] 1-1: Black formatting check failed. File would be reformatted. Run 'black --line-length 100' to fix code style issues.

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Check-Remote-URLs
🔇 Additional comments (17)
rules/work/genes/omim.smk (1)

44-48: Skipping qsv format checks may hide malformed input

QSV_SKIP_FORMAT_CHECK=1 disables CSV/TSV validation only for this join, but the earlier join at lines 32-39 still runs with checks enabled.
Confirm that the upstream files really contain broken rows that can be ignored; otherwise this setting could silently propagate bad data.

Would you verify that all rows in $TMPDIR/phenotype.hpoa have the same delimiter count? If the issue is limited to a few outliers, pre-cleaning them is safer than suppressing checks globally.

scripts/genes-integrate-diseases.py (1)

147-150: Confirm downstream handling of new Aspect.PAST_MEDICAL_HISTORY value

The new enum member extends the valid aspect set, but downstream code that switches/filters on Aspect (e.g., analytics, JSON serialisation in other scripts, database schemas) might still assume the old closed set.
Please verify that:

  1. All persisted data / DB schemas accept "H" values.
  2. Any match/if statements on Aspect now have a default/else branch or include the new value.
  3. Unit tests cover parsing HPOA rows with "H".

No changes are required here if the above is already covered.

rules/work/genes/rcnv.smk (1)

26-35: Addition of QSV_SKIP_FORMAT_CHECK=1 looks good

The environment variable is scoped to the pipeline command only and should silence the strict type-checking that previously broke on header mismatches. Nothing else changes; ✔️ from my side.

rules/work/annos/seqvars/dbscsnv.smk (1)

28-31: Box URL may be ephemeral – verify accessibility or cache

Switching from an FTP mirror to Box improves TLS security but Box links are often tokenised and can expire without notice.
Please run a quick HEAD check (or try downloading in CI) and ensure the file is publicly accessible long-term, or mirror it to an immutable location (e.g., Zenodo, GCS bucket).

rules/work/annos/seqvars/dbnsfp.smk (1)

57-65: Box links longevity

As with the dbscSNV rule, ensure both Box URLs are permanent shared links; otherwise CI builds may fail unexpectedly. If permanence cannot be guaranteed, consider hosting the archives on a stable mirror.

rules/reduced/hpo.smk (1)

59-62: Potential semantic change: mem_mb now denotes total memory, not per-CPU

mem_mb_per_cpumem_mb changes the meaning from per thread to aggregate.
With the previous default of 2GB × VIGUNO_SIMULATE_THREADS, the job could rely on ~192 GB when THREADS ≈ 96. Supplying the same literal value via a single MEMORY constant will cut the effective allocation dramatically and may trigger OOM on large workers.

Consider deriving MEMORY from THREADS, or keep the per-CPU semantics:

# In Snakefile
MEMORY = int(os.getenv("MEMORY_PER_CPU_MB", "2000")) * THREADS

or switch back to mem_mb_per_cpu.

Please double-check cluster profiles / executor configs to ensure this rule still receives the intended memory.

environment.yml (1)

39-48: Pin versions consistently & add checksum for new critical tools

Upgrading bcftools/htslib/samtools to 1.21 is fine, but keep them strictly synchronised:

- bcftools =1.21
- htslib  =1.21
- samtools=1.21

Also, the newly added qsv is unpinned. For reproducible builds pin a specific version tag and optionally provide an SHA256 checksum via the url: syntax.

Please ensure that downstream rules still pass with the 1.21 toolchain (index formats changed in 1.20).

rules/output/mehari/freqs.smk (2)

8-9: Correct input path swap looks good

The genomes / exomes targets now point to the proper directories, fixing the earlier inversion bug. 👍


25-29: Confirm MEMORY represents total memory, not per-CPU

mem_mb_per_cpu was replaced with mem_mb=MEMORY. Many cluster back-ends interpret mem_mb as the total job memory.
If MEMORY still contains a “per-CPU” value (e.g. 2000 MB), the job will be severely under-provisioned (2 GB total instead of 2 GB × threads).
Please verify and adjust either the constant or switch back to mem_mb_per_cpu.

rules/reduced/annonars.smk (1)

41-45: Resource constants adoption looks good

Using the global THREADS and MEMORY keeps resource management consistent across the workflow.

rules/output/annonars/dbscsnv.smk (1)

24-28: Unified resource declaration approved

Switching to THREADS / MEMORY aligns this rule with the new global convention.

rules/output/annonars/dbsnp.smk (1)

15-19: Consistent resource constants

The update is straightforward and correct.

rules/output/annonars/dbnsfp.smk (1)

22-26: Resource constants update OK

No issues spotted with the transition to THREADS / MEMORY.

rules/output/annonars/gnomad_genomes.smk (1)

19-22: Confirm that global THREADS / MEMORY resolve to integers

Swapping the per-rule environment look-ups for the shared constants is fine, but please double-check that
THREADS and MEMORY are defined before this file is parsed and that both end up as plain integers (not strings).
Several cluster back-ends will silently ignore non-numeric values and the job may be scheduled with default resources.

If you are unsure, add a quick assertion in the top-level Snakefile:

assert isinstance(THREADS, int) and isinstance(MEMORY, int)
rules/work/genes/ensembl.smk (1)

23-24: Verify archived URL correctness

https://may2024.archive.ensembl.org/biomart/martservice?... looks correct today,
but archive sub-domains change over time. Consider capturing the files once and
hosting on a stable bucket or at least validating reachability at workflow start.

rules/work/genes/mehari_data_tx.smk (2)

4-12: LGTM!

The addition of the {refseq_or_ensembl} wildcard correctly extends the file naming pattern to distinguish between RefSeq and Ensembl annotations.


15-26: LGTM!

The copy rule correctly mirrors the download rule's wildcard pattern for consistency.

Comment on lines 15 to +17
wget --no-check-certificate \
-O {output.zip} \
http://3dgenome.fsm.northwestern.edu/downloads/$name.TADs.zip
https://3dgenome.fsm.northwestern.edu/downloads/$name.TADs.zip
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

--no-check-certificate defeats the purpose of switching to HTTPS

Now that the download uses https://, leaving --no-check-certificate in place disables all certificate validation and yields the same security level as plain HTTP.
Unless the upstream host really serves an invalid certificate, drop the flag and fail fast on TLS errors.

-        wget --no-check-certificate \
+        wget \
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
wget --no-check-certificate \
-O {output.zip} \
http://3dgenome.fsm.northwestern.edu/downloads/$name.TADs.zip
https://3dgenome.fsm.northwestern.edu/downloads/$name.TADs.zip
wget \
-O {output.zip} \
https://3dgenome.fsm.northwestern.edu/downloads/$name.TADs.zip
🤖 Prompt for AI Agents
In rules/work/annos/features/tads.smk around lines 15 to 17, the wget command
uses the --no-check-certificate flag which disables TLS certificate validation,
negating the security benefits of HTTPS. Remove the --no-check-certificate flag
from the wget command so that certificate validation is enforced and the
download fails fast on TLS errors.

Comment on lines +19 to 23
threads: THREADS
resources:
runtime=os.environ.get("RUNTIME_ANNONARS_IMPORT", "48h"),
mem_mb_per_cpu=2000,
mem_mb=MEMORY,
wildcard_constraints:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Risk of under-provisioning memory for large annonars imports

Similar to the previous file, replacing mem_mb_per_cpu=2000 with a flat mem_mb=MEMORY changes the contract with the scheduler. Annonars import often requires >100 GB RAM for large datasets; if MEMORY is not scaled with THREADS you may see failures late in the run.

-    mem_mb=MEMORY,
+    mem_mb=int(os.getenv("MEMORY_PER_CPU_MB", "2000")) * THREADS,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
threads: THREADS
resources:
runtime=os.environ.get("RUNTIME_ANNONARS_IMPORT", "48h"),
mem_mb_per_cpu=2000,
mem_mb=MEMORY,
wildcard_constraints:
threads: THREADS
resources:
runtime=os.environ.get("RUNTIME_ANNONARS_IMPORT", "48h"),
mem_mb=int(os.getenv("MEMORY_PER_CPU_MB", "2000")) * THREADS,
wildcard_constraints:
🤖 Prompt for AI Agents
In rules/output/annonars/helix.smk around lines 19 to 23, the memory allocation
was changed from mem_mb_per_cpu=2000 to a fixed mem_mb=MEMORY, which risks
under-provisioning memory for large annonars imports. To fix this, revert to
using mem_mb_per_cpu scaled by THREADS or ensure MEMORY is dynamically
calculated as THREADS multiplied by the per-CPU memory requirement to provide
sufficient total memory for the job.

Comment on lines 40 to 44
threads: THREADS
resources:
runtime=os.environ.get("RUNTIME_ANNONARS_IMPORT", "48h"),
mem_mb_per_cpu=2000,
mem_mb=MEMORY
shell:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Memory now fixed instead of scaling with thread count – risk of under-/overallocation

Previously the rule allocated mem_mb_per_cpu=2000, i.e. memory scaled linearly with the requested CPU threads.
After the refactor the rule requests a fixed mem_mb=MEMORY while still allowing THREADS CPUs.
If THREADS is left at its former default (96) but MEMORY is, for example, 128 000 MB, each core effectively only gets ~1.3 GB instead of the former 2 GB. Conversely, small thread counts might now over-allocate RAM.

Please ensure that

MEMORY ≈ THREADS * former_per_cpu_mem

or introduce a MEMORY_PER_THREAD constant and compute mem_mb=THREADS * MEMORY_PER_THREAD to preserve the previous behaviour.


I’d like to inspect how THREADS and MEMORY are defined (and whether any mem_mb_per_cpu was used before) to confirm if memory really isn’t scaling anymore. Let me pull the relevant parts of mehari.smk:


🏁 Script executed:

#!/bin/bash
set -eux

# Check if mem_mb_per_cpu is still used anywhere
rg -n "mem_mb_per_cpu"

# Show the resource block in mehari.smk
rg -n "mem_mb=" -A2 rules/reduced/mehari.smk
rg -n "threads:" -A2 rules/reduced/mehari.smk

# Locate where THREADS and MEMORY are defined
rg -n "THREADS" -A2 rules/reduced/mehari.smk
rg -n "MEMORY" -A2 rules/reduced/mehari.smk

# Show the first 200 lines to catch default assignments
head -n 200 rules/reduced/mehari.smk

Length of output: 33


I’ll first locate the Snakemake file and inspect both its resource block and any prior use of mem_mb_per_cpu without exiting on no-matches:


🏁 Script executed:

#!/bin/bash
set -uxo pipefail

# 1. Find the mehari.smk file
FILE=$(find . -type f -name mehari.smk | head -n1)
echo "Found mehari.smk at: $FILE"

# 2. Check for any legacy mem_mb_per_cpu usage
rg -n "mem_mb_per_cpu" . || echo "→ no mem_mb_per_cpu found"

# 3. Show the resource stanza around threads/resources/mem_mb
rg -n "threads:" -A3 "$FILE" || true
rg -n "resources:" -A5 "$FILE" || true
rg -n "mem_mb=" -A3 "$FILE" || true

# 4. Locate where THREADS and MEMORY are defined or defaulted
rg -n "THREADS" -A3 "$FILE" || true
rg -n "MEMORY" -A3 "$FILE" || true

# 5. Show the first 200 lines to catch any defaults or comments
head -n 200 "$FILE"

Length of output: 3404


Ensure memory scales with thread count

We’ve confirmed that mem_mb_per_cpu is no longer used in rules/reduced/mehari.smk (lines 40–44) and the rule now requests a fixed mem_mb=MEMORY regardless of THREADS. This change decouples RAM from CPU count and can lead to under- or over-allocation.

Please restore the previous scaling behavior (2 GB per CPU) by choosing one of the following:

  • Introduce a MEMORY_PER_THREAD = 2000 constant and update the resource block to:
     threads: THREADS
     resources:
  •  mem_mb=MEMORY
    
  •  mem_mb=THREADS * MEMORY_PER_THREAD
    
  • Or ensure the default MEMORY value is defined as THREADS * 2000.

• File: rules/reduced/mehari.smk (lines 40–44)

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In rules/reduced/mehari.smk around lines 40 to 44, the memory allocation no
longer scales with the number of CPU threads, causing potential under- or
over-allocation. Fix this by either defining a constant MEMORY_PER_THREAD set to
2000 and updating the resource block to mem_mb=THREADS * MEMORY_PER_THREAD, or
ensure that the MEMORY variable is assigned as THREADS multiplied by 2000 before
being used. This will restore the previous behavior of allocating 2 GB of memory
per CPU thread.

Comment on lines 114 to 129
#if wildcards.version == DV.gnomad_v4:
tpl = "work/download/annos/grch38/seqvars/gnomad_{kind}/{version}/gnomad.{kind}.v{version}.sites.chr{chrom}.vcf.bgz"
return [tpl.format(chrom=chrom, **wildcards) for chrom in chroms]
# else:
# if wildcards.kind == "exomes":
# tpl = "work/download/annos/grch38/seqvars/gnomad_{kind}/{version}/gnomad.{kind}.r{version}.sites.{chrom}.liftover_grch38.vcf.bgz"
# return [
# tpl.format(kind=wildcards.kind, version=DV.gnomad_v2, chrom=chrom)
# for chrom in chroms
# ]
# else:
# tpl = "work/download/annos/grch38/seqvars/gnomad_{kind}/{version}/gnomad.{kind}.v{version}.sites.chr{chrom}.vcf.bgz"
# return [
# tpl.format(kind=wildcards.kind, version=DV.gnomad_v3, chrom=chrom)
# for chrom in chroms
# ]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Input path logic for gnomAD-v2 GRCh38 now broken

The simplified implementation unconditionally builds paths of the form

.../vcf/{kind}/gnomad.{kind}.v{version}.sites.chr{chrom}.vcf.bgz

For GRCh38 v2 exomes the files live under
liftover_grch38/vcf/{kind}/gnomad.{kind}.r{version}.sites.{chrom}.liftover_grch38.vcf.bgz
as still reflected by the dedicated download rule annos_gnomad_nuclear_download_grch38_liftover_v2 above.

Thus the workflow will raise “missing input file” errors for any target requiring gnomAD-v2 exomes/genomes on GRCh38.

 def input_annos_gnomad_nuclear_grch38(wildcards):
@@
-    #if wildcards.version == DV.gnomad_v4:
-    tpl = "work/download/annos/grch38/seqvars/gnomad_{kind}/{version}/gnomad.{kind}.v{version}.sites.chr{chrom}.vcf.bgz"
-    return [tpl.format(chrom=chrom, **wildcards) for chrom in chroms]
+# Handle the three different layouts explicitly
+    if wildcards.version == DV.gnomad_v4:
+        tpl = (
+            "work/download/annos/grch38/seqvars/gnomad_{kind}/{version}/"
+            "gnomad.{kind}.v{version}.sites.chr{chrom}.vcf.bgz"
+        )
+    elif wildcards.version == DV.gnomad_v3:
+        tpl = (
+            "work/download/annos/grch38/seqvars/gnomad_{kind}/{version}/"
+            "gnomad.{kind}.v{version}.sites.chr{chrom}.vcf.bgz"
+        )
+    else:  # v2 exomes/genomes – liftover files
+        tpl = (
+            "work/download/annos/grch38/seqvars/gnomad_{kind}/{version}/"
+            "gnomad.{kind}.r{version}.sites.{chrom}.liftover_grch38.vcf.bgz"
+        )
+    return [tpl.format(chrom=chrom, **wildcards) for chrom in chroms]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
#if wildcards.version == DV.gnomad_v4:
tpl = "work/download/annos/grch38/seqvars/gnomad_{kind}/{version}/gnomad.{kind}.v{version}.sites.chr{chrom}.vcf.bgz"
return [tpl.format(chrom=chrom, **wildcards) for chrom in chroms]
# else:
# if wildcards.kind == "exomes":
# tpl = "work/download/annos/grch38/seqvars/gnomad_{kind}/{version}/gnomad.{kind}.r{version}.sites.{chrom}.liftover_grch38.vcf.bgz"
# return [
# tpl.format(kind=wildcards.kind, version=DV.gnomad_v2, chrom=chrom)
# for chrom in chroms
# ]
# else:
# tpl = "work/download/annos/grch38/seqvars/gnomad_{kind}/{version}/gnomad.{kind}.v{version}.sites.chr{chrom}.vcf.bgz"
# return [
# tpl.format(kind=wildcards.kind, version=DV.gnomad_v3, chrom=chrom)
# for chrom in chroms
# ]
# Handle the three different layouts explicitly
if wildcards.version == DV.gnomad_v4:
tpl = (
"work/download/annos/grch38/seqvars/gnomad_{kind}/{version}/"
"gnomad.{kind}.v{version}.sites.chr{chrom}.vcf.bgz"
)
elif wildcards.version == DV.gnomad_v3:
tpl = (
"work/download/annos/grch38/seqvars/gnomad_{kind}/{version}/"
"gnomad.{kind}.v{version}.sites.chr{chrom}.vcf.bgz"
)
else: # v2 exomes/genomes – liftover files
tpl = (
"work/download/annos/grch38/seqvars/gnomad_{kind}/{version}/"
"gnomad.{kind}.r{version}.sites.{chrom}.liftover_grch38.vcf.bgz"
)
return [tpl.format(chrom=chrom, **wildcards) for chrom in chroms]
🤖 Prompt for AI Agents
In rules/work/annos/seqvars/gnomad_nuclear.smk around lines 114 to 129, the
current path logic for gnomAD v2 exomes on GRCh38 is incorrect and causes
missing input file errors. Restore the conditional logic to differentiate v2
exomes by checking if wildcards.kind is "exomes" and use the liftover_grch38
path template with the correct filename pattern including "r{version}" and
"liftover_grch38" suffix. For other cases, use the appropriate templates for v3
or v4 versions as before. This will ensure the correct input paths are generated
for all gnomAD versions and kinds.

Comment on lines 15 to 18
https://may2024.archive.ensembl.org/pub/current_README
grep "The current release is Ensembl {DV.ensembl}" $TMPDIR/current_README \
|| (echo "Ensembl version is not {DV.ensembl}." && exit 1)
|| (echo "WARNING: Ensembl version is not {DV.ensembl}.")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Brace expansion may not interpolate DV.ensembl here

shell: strings are not prefixed with f, so Python itself will not substitute
{DV.ensembl}. Snakemake’s shell() formatter only expands {wildcards}, {input},
{output}, {threads}, {resources}, and {params} by default.
DV is a plain global variable and will therefore stay literally {DV.ensembl} inside the
command, resulting in the grep always failing.

Two quick fixes (choose one):

-        grep "The current release is Ensembl {DV.ensembl}" $TMPDIR/current_README \
+        grep "The current release is Ensembl {params.ensembl_ver}" $TMPDIR/current_README \

and then pass the version via params: ensembl_ver=DV.ensembl

or simply turn the whole block into an f-string:

-    shell:
-        r"""
+    shell:
+        fr"""

so normal Python formatting fills in DV.ensembl.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
https://may2024.archive.ensembl.org/pub/current_README
grep "The current release is Ensembl {DV.ensembl}" $TMPDIR/current_README \
|| (echo "Ensembl version is not {DV.ensembl}." && exit 1)
|| (echo "WARNING: Ensembl version is not {DV.ensembl}.")
shell:
fr"""
https://may2024.archive.ensembl.org/pub/current_README
grep "The current release is Ensembl {DV.ensembl}" $TMPDIR/current_README \
|| (echo "WARNING: Ensembl version is not {DV.ensembl}.")
"""
🤖 Prompt for AI Agents
In rules/work/genes/ensembl.smk around lines 15 to 18, the shell command uses
{DV.ensembl} which is not expanded because Snakemake's shell formatter does not
substitute plain global variables. To fix this, either pass DV.ensembl as a
parameter using params: ensembl_ver=DV.ensembl and reference it as
{params.ensembl_ver} in the shell command, or convert the shell command string
into a Python f-string so that {DV.ensembl} is properly interpolated before
execution.

Comment on lines +14 to 15
CLINVAR_RELEASE = os.environ.get("CLINVAR_RELEASE", "20250410+0.18.5")
#: The ClinVar version to use (part of the tag and release name).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Pipeline is red – Black re-formatting required

CI failed with “Black formatting check failed”. Please run:

black --line-length 100 varfish_db_downloader/versions.py

and commit the result. No functional issues detected – this is purely styling.

🤖 Prompt for AI Agents
In varfish_db_downloader/versions.py around lines 14 to 15, the code does not
comply with Black formatting standards, causing the CI to fail. Run the command
"black --line-length 100 varfish_db_downloader/versions.py" to automatically
reformat the file according to Black's style guidelines, then commit the updated
file to fix the styling issue.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
Snakefile (1)

90-94: Removal of unused wildcards parameter is spot-on

The function signature is now minimal and the call site was updated accordingly. Nice cleanup.

🧹 Nitpick comments (2)
Snakefile (2)

45-50: Hard-coded resource defaults are overly aggressive

THREADS = 32 and especially MEMORY = 196000 MB (~196 GB) can overwhelm CI runners or laptops when the corresponding env-vars are unset. Consider:

  • Choosing safer fall-backs (e.g. 8 threads, 32000 MB), or
  • Aborting with a descriptive error if the vars are not provided.

This avoids unexpected OOM or oversubscription.


94-97: Normalise genome-build strings to lower-case

If the user exports GENOMEBUILDS="GRCh37", the later checks
"grch37" in genomebuilds will fail. Convert to lower-case once:

-    genomebuilds = GENOMEBUILDS
+    genomebuilds = [gb.lower() for gb in GENOMEBUILDS]
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 761e48f and 131cd2a.

📒 Files selected for processing (16)
  • Snakefile (3 hunks)
  • download_urls.yml (1 hunks)
  • rules/output/annonars/cadd.smk (1 hunks)
  • rules/output/annonars/cons.smk (1 hunks)
  • rules/output/annonars/dbnsfp.smk (1 hunks)
  • rules/output/annonars/dbscsnv.smk (1 hunks)
  • rules/output/annonars/gnomad_exomes.smk (1 hunks)
  • rules/output/annonars/gnomad_genomes.smk (1 hunks)
  • rules/output/annonars/gnomad_mtdna.smk (1 hunks)
  • rules/output/annonars/gnomad_sv.smk (4 hunks)
  • rules/output/mehari/freqs.smk (2 hunks)
  • rules/output/viguno/hpo.smk (1 hunks)
  • rules/reduced/annonars.smk (1 hunks)
  • rules/reduced/mehari.smk (1 hunks)
  • rules/work/annos/seqvars/gnomad_nuclear.smk (1 hunks)
  • varfish_db_downloader/versions.py (3 hunks)
✅ Files skipped from review due to trivial changes (2)
  • download_urls.yml
  • rules/work/annos/seqvars/gnomad_nuclear.smk
🚧 Files skipped from review as they are similar to previous changes (13)
  • rules/output/annonars/gnomad_mtdna.smk
  • rules/reduced/mehari.smk
  • rules/reduced/annonars.smk
  • rules/output/annonars/cons.smk
  • rules/output/mehari/freqs.smk
  • rules/output/annonars/cadd.smk
  • rules/output/viguno/hpo.smk
  • rules/output/annonars/gnomad_genomes.smk
  • rules/output/annonars/dbnsfp.smk
  • rules/output/annonars/gnomad_exomes.smk
  • rules/output/annonars/dbscsnv.smk
  • varfish_db_downloader/versions.py
  • rules/output/annonars/gnomad_sv.smk
🔇 Additional comments (1)
Snakefile (1)

180-181: Validate GENOMEBUILDS early to avoid KeyError

Look-ups such as gnomad_versions[genomebuild] assume only grch37 or grch38.
If someone passes another build, the workflow crashes deep inside generate_input_files.
Add an upfront check and fail fast with a clear message.

unknown = set(genomebuilds) - {"grch37", "grch38"}
if unknown:
    raise ValueError(f"Unsupported genome build(s): {', '.join(unknown)}")

Comment on lines +43 to +44
GENOMEBUILDS = os.environ.get("GENOMEBUILDS", "grch37,grch38").split(",")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Missing import os causes runtime failure

os.environ is referenced here but os is never imported in this file (it’s also used later at line 56). This will raise NameError when the Snakefile is parsed.

-from varfish_db_downloader.versions import (
+import os
+from varfish_db_downloader.versions import (
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
GENOMEBUILDS = os.environ.get("GENOMEBUILDS", "grch37,grch38").split(",")
import os
from varfish_db_downloader.versions import (
VERSION,
)
GENOMEBUILDS = os.environ.get("GENOMEBUILDS", "grch37,grch38").split(",")
🤖 Prompt for AI Agents
In Snakefile around lines 43 to 44, the code uses os.environ but the os module
is not imported, causing a NameError at runtime. Add an import statement for the
os module at the top of the Snakefile to fix this issue.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
excerpt-data/e8a29ff831bf8ee3/hgnc_complete_set.json (1)

1-3: Ignore LFS pointer file in JSON linting
This file is a Git LFS pointer, not actual JSON—Biome’s errors here are false positives. Please update your lint configuration (e.g., add excerpt-data/**/*.json to .biomeignore or adjust .gitattributes) so these pointer files are excluded from JSON parsing.

varfish_db_downloader/wget.py (1)

62-64: Use logger instead of raw print for error reporting.

print() bypasses the project’s structured logging and can be swallowed in daemonised contexts.

-        print(f"Error fetching URL: {e}")
+        logger.error("Error fetching URL {}: {}", url, e)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d5c819a and 9cefbe3.

⛔ Files ignored due to path filters (12)
  • excerpt-data/0ccc4915e7ecfd38/mehari-data-txs-GRCh37-refseq-0.10.3.bin.zst is excluded by !**/*.zst
  • excerpt-data/18c67f3b60f24667/mehari-data-txs-GRCh38-refseq-0.10.3.bin.zst is excluded by !**/*.zst
  • excerpt-data/1963f3c58ea066be/omim_unmapped_terms.tsv is excluded by !**/*.tsv
  • excerpt-data/5bcb7090fe0e881b/mehari-data-txs-GRCh37-ensembl-0.10.3.bin.zst is excluded by !**/*.zst
  • excerpt-data/615312ce3f5fc1bf/OMIMinDO.tsv is excluded by !**/*.tsv
  • excerpt-data/617bebe58c82f24e/CTD_diseases.tsv.gz is excluded by !**/*.gz
  • excerpt-data/649dbe83da5f50d9/mehari-data-txs-GRCh38-ensembl-0.10.3.bin.zst is excluded by !**/*.zst
  • excerpt-data/b31e9b26b7aeae3f/dbNSFP4.5c.zip is excluded by !**/*.zip
  • excerpt-data/bdc69c1e4cafdfaa/clinvar-data-extract-vars-20240612+0.17.0.tar.gz is excluded by !**/*.gz
  • excerpt-data/c243f918f5fc10d2/dbNSFP4.5a.zip is excluded by !**/*.zip
  • excerpt-data/caba5539b2dac784/clinvar-data-extract-vars-20250410+0.18.5.tar.gz is excluded by !**/*.gz
  • excerpt-data/cbc00fe1a5b22fe8/dbscSNV1.1.zip is excluded by !**/*.zip
📒 Files selected for processing (20)
  • download_urls.yml (3 hunks)
  • excerpt-data/0ccc4915e7ecfd38/url.txt (1 hunks)
  • excerpt-data/111d8c6e08038f62/20 (1 hunks)
  • excerpt-data/18c67f3b60f24667/url.txt (1 hunks)
  • excerpt-data/5bcb7090fe0e881b/url.txt (1 hunks)
  • excerpt-data/649dbe83da5f50d9/url.txt (1 hunks)
  • excerpt-data/652646c24140df2a/mondo.obo (1 hunks)
  • excerpt-data/6f378db589a4bbb9/orphacodes (1 hunks)
  • excerpt-data/91f964d1aa8367a5/20 (1 hunks)
  • excerpt-data/b31e9b26b7aeae3f/url.txt (1 hunks)
  • excerpt-data/bdc69c1e4cafdfaa/url.txt (0 hunks)
  • excerpt-data/c243f918f5fc10d2/url.txt (1 hunks)
  • excerpt-data/c9c7d6df0e24b954/__index__ (1 hunks)
  • excerpt-data/caba5539b2dac784/url.txt (1 hunks)
  • excerpt-data/cbc00fe1a5b22fe8/url.txt (1 hunks)
  • excerpt-data/cdaaf7a3f7595d3d/__index__ (1 hunks)
  • excerpt-data/e8a29ff831bf8ee3/hgnc_complete_set.json (1 hunks)
  • excerpt-data/e8a29ff831bf8ee3/url.txt (1 hunks)
  • excerpt-data/ebc07f725c64907d/__index__ (1 hunks)
  • varfish_db_downloader/wget.py (4 hunks)
💤 Files with no reviewable changes (1)
  • excerpt-data/bdc69c1e4cafdfaa/url.txt
✅ Files skipped from review due to trivial changes (16)
  • excerpt-data/649dbe83da5f50d9/url.txt
  • excerpt-data/91f964d1aa8367a5/20
  • excerpt-data/c9c7d6df0e24b954/index
  • excerpt-data/e8a29ff831bf8ee3/url.txt
  • excerpt-data/cbc00fe1a5b22fe8/url.txt
  • excerpt-data/0ccc4915e7ecfd38/url.txt
  • excerpt-data/cdaaf7a3f7595d3d/index
  • excerpt-data/b31e9b26b7aeae3f/url.txt
  • excerpt-data/5bcb7090fe0e881b/url.txt
  • excerpt-data/6f378db589a4bbb9/orphacodes
  • excerpt-data/c243f918f5fc10d2/url.txt
  • excerpt-data/18c67f3b60f24667/url.txt
  • excerpt-data/caba5539b2dac784/url.txt
  • excerpt-data/111d8c6e08038f62/20
  • excerpt-data/ebc07f725c64907d/index
  • excerpt-data/652646c24140df2a/mondo.obo
🚧 Files skipped from review as they are similar to previous changes (1)
  • download_urls.yml
🧰 Additional context used
🪛 Ruff (0.11.9)
varfish_db_downloader/wget.py

15-15: urllib.parse.urlparse imported but unused

Remove unused import: urllib.parse.urlparse

(F401)

🪛 GitHub Actions: CI
varfish_db_downloader/wget.py

[error] 15-15: flake8: 'urllib.parse.urlparse' imported but unused (F401)

🪛 Biome (1.9.4)
excerpt-data/e8a29ff831bf8ee3/hgnc_complete_set.json

[error] 1-1: String values must be double quoted.

(parse)


[error] 1-1: String values must be double quoted.

(parse)


[error] 1-3: End of file expected

Use an array for a sequence of values: [1, 2]

(parse)


[error] 3-3: End of file expected

Use an array for a sequence of values: [1, 2]

(parse)

⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Check-Remote-URLs

Comment on lines +26 to +68
def get_filename_from_url(url: str) -> str | None:
"""
Extracts the filename from a URL.

First, it tries to get the filename from the 'Content-Disposition' header.
If not found, it falls back to parsing the URL path.

Args:
url: The URL of the file to inspect.

Returns:
The detected filename as a string, or None if an error occurs
or no filename can be found.
"""
try:
# We use stream=True to avoid downloading the file content
with requests.get(url, allow_redirects=True, stream=True) as response:
# Raise an exception for bad status codes (4xx or 5xx)
response.raise_for_status()

# 1. Check for Content-Disposition header
content_disposition = response.headers.get("content-disposition")
if content_disposition:
# Parse the header to extract the filename
_, params = cgi.parse_header(content_disposition)
if "filename" in params:
return params["filename"]

# 2. If no header, fall back to the URL path
path = urllib.parse.urlparse(
response.url
).path # Use response.url to get the final URL after redirects
filename = os.path.basename(path)
if filename:
return filename

except requests.exceptions.RequestException as e:
print(f"Error fetching URL: {e}")
return None

# 3. If all else fails
return None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Modernise get_filename_from_url—avoid deprecated cgi, add timeout & cheaper HEAD probe.

  1. cgi.parse_header has been deprecated since 3.11 and will be removed in 3.13 (PEP 594).
  2. A plain GET (even with stream=True) is still heavier than a HEAD; many servers provide the same headers with no body.
  3. No timeout ⇒ risk of hanging download workers.
  4. Handling of RFC 5987 filename*= parameters is missing.
-import cgi
+from email.message import Message
-        with requests.get(url, allow_redirects=True, stream=True) as response:
+        # Cheap probe: try HEAD first, fall back to GET if server forbids it.
+        try:
+            response = requests.head(url, allow_redirects=True, timeout=10)
+            if response.status_code == 405:  # Method Not Allowed
+                response = requests.get(url, allow_redirects=True, stream=True, timeout=10)
+        except requests.exceptions.RequestException:
+            # second attempt with GET already failed (or HEAD failed hard)
+            raise
+        # Use context-manager semantics
+        with response:
-            content_disposition = response.headers.get("content-disposition")
-            if content_disposition:
-                # Parse the header to extract the filename
-                _, params = cgi.parse_header(content_disposition)
-                if "filename" in params:
-                    return params["filename"]
+            cd_header = response.headers.get("content-disposition")
+            if cd_header:
+                msg = Message()
+                msg["content-disposition"] = cd_header
+                params = msg.get_params(header="content-disposition", failobj=[])
+                for key, val in params:
+                    if key in ("filename", "filename*"):
+                        # RFC 5987 handling for filename*=UTF-8''name.ext
+                        return urllib.parse.unquote(val.split("''")[-1])

These tweaks keep us future-proof and more efficient.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def get_filename_from_url(url: str) -> str | None:
"""
Extracts the filename from a URL.
First, it tries to get the filename from the 'Content-Disposition' header.
If not found, it falls back to parsing the URL path.
Args:
url: The URL of the file to inspect.
Returns:
The detected filename as a string, or None if an error occurs
or no filename can be found.
"""
try:
# We use stream=True to avoid downloading the file content
with requests.get(url, allow_redirects=True, stream=True) as response:
# Raise an exception for bad status codes (4xx or 5xx)
response.raise_for_status()
# 1. Check for Content-Disposition header
content_disposition = response.headers.get("content-disposition")
if content_disposition:
# Parse the header to extract the filename
_, params = cgi.parse_header(content_disposition)
if "filename" in params:
return params["filename"]
# 2. If no header, fall back to the URL path
path = urllib.parse.urlparse(
response.url
).path # Use response.url to get the final URL after redirects
filename = os.path.basename(path)
if filename:
return filename
except requests.exceptions.RequestException as e:
print(f"Error fetching URL: {e}")
return None
# 3. If all else fails
return None
# --- at the top of the file, replace the old import ---
-from cgi import parse_header
+from email.message import Message
def get_filename_from_url(url: str) -> str | None:
"""
Extracts the filename from a URL.
First, it tries to get the filename from the 'Content-Disposition' header.
If not found, it falls back to parsing the URL path.
Args:
url: The URL of the file to inspect.
Returns:
The detected filename as a string, or None if an error occurs
or no filename can be found.
"""
try:
# Cheap probe: try HEAD first, fall back to GET if server forbids it.
try:
response = requests.head(url, allow_redirects=True, timeout=10)
if response.status_code == 405: # Method Not Allowed
response = requests.get(url, allow_redirects=True, stream=True, timeout=10)
except requests.exceptions.RequestException:
# second attempt (or HEAD itself) failed hard
raise
# Ensure the connection is closed when we're done
with response:
# Raise for 4xx/5xx
response.raise_for_status()
# 1. Check for Content-Disposition header (with RFC 5987 support)
cd_header = response.headers.get("content-disposition")
if cd_header:
msg = Message()
msg["content-disposition"] = cd_header
params = msg.get_params(header="content-disposition", failobj=[])
for key, val in params:
if key in ("filename", "filename*"):
# RFC 5987: filename*=UTF-8''name.ext
return urllib.parse.unquote(val.split("''")[-1])
# 2. If no header, fall back to the URL path
path = urllib.parse.urlparse(response.url).path
filename = os.path.basename(path)
if filename:
return filename
except requests.exceptions.RequestException as e:
print(f"Error fetching URL: {e}")
return None
# 3. If all else fails
return None
🤖 Prompt for AI Agents
In varfish_db_downloader/wget.py from lines 26 to 68, update
get_filename_from_url to replace the deprecated cgi.parse_header with a modern
alternative like email.message.Message for parsing headers, add a timeout
parameter to the requests call to prevent hanging, switch from using a GET
request to a HEAD request to reduce bandwidth usage while still retrieving
headers, and enhance filename extraction by handling RFC 5987 encoded filename*
parameters in the Content-Disposition header for better compatibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant