-
Notifications
You must be signed in to change notification settings - Fork 2
feat: make release grch37 optional #105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Important Review skippedMore than 25% of the files skipped due to max files limit. The review is being skipped to prevent a low-quality review. 149 files out of 257 files are above the max files limit of 100. Please upgrade to Pro plan to get higher limits. You can disable this status message by setting the WalkthroughThis update introduces centralized resource management in Snakemake rules by replacing per-rule environment variable lookups with global Changes
Poem
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Please check for formatting and linting issues. You can use
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 10
🔭 Outside diff range comments (2)
rules/output/mehari/freqs.smk (1)
81-84
:⚠️ Potential issueVariable typo will break template rendering
--value v_annovars={wildcards.v_annonars}
mixes two spellings. Unless the template expects the misspelled key, this will leave the field empty.- --value v_annovars={wildcards.v_annonars} \ + --value v_annonars={wildcards.v_annonars} \rules/reduced/annonars.smk (1)
33-41
:⚠️ Potential issueWildcard name mismatch blocks rule execution
Output patterns use
{version_multi}
butwildcard_constraints
declareversions
. Snakemake will fail to resolve the rule.- versions=RE_VERSION_MULTI, + version_multi=RE_VERSION_MULTI,
♻️ Duplicate comments (5)
rules/output/annonars/gnomad_sv.smk (3)
73-77
: Duplicate of the comment above – same consideration applies here.
126-130
: Duplicate of the comment above – same consideration applies here.
178-182
: Duplicate of the comment above – same consideration applies here.rules/output/annonars/gnomad_exomes.smk (1)
19-22
: Same resource sanity check as for genomes rulePlease verify that the global
THREADS
/MEMORY
constants are initialised to integer values before rule
collection starts. Otherwise the scheduler might treat them as strings and ignore the settings.rules/output/annonars/cadd.smk (1)
57-60
: Global resource constants: double-check numeric typesThe switch to the central
THREADS
/MEMORY
variables is welcome.
Just make sure they are integers (see earlier comment) so the cluster executor picks them up.
🧹 Nitpick comments (11)
scripts/genes-orpha-diseases.py (1)
39-41
: Extract the 60 s timeout into a named constantThe 60-second literal now appears twice (here and at line 43). Pull it up to a module-level constant such as
HTTP_TIMEOUT = 60
so that future adjustments only touch one place.- await client.get(URL_ORPHACODE_GET.format(orpha_id), timeout=60) + await client.get( + URL_ORPHACODE_GET.format(orpha_id), + timeout=HTTP_TIMEOUT, + )rules/work/genes/shet.smk (1)
12-18
: Re-evaluate necessity ofQSV_SKIP_FORMAT_CHECK=1
The entire Weghorn table is joined against a small x-link TSV. Skipping format validation might mask column drift in future releases.
If the header mismatch is known and acceptable, add a short comment explaining why the check is disabled to aid future maintainers.rules/work/genes/gnomad.smk (1)
164-170
: Inconsistent use ofQSV_SKIP_FORMAT_CHECK
within the same pipelineThe flag is set for
qsv select
but not for the subsequentqsv join
(lines 173-177). If the input really violates the CSV spec, the join will fail later anyway.
Either:
- remove the flag here, or
- export it once at the top of the shell block so all
qsv
invocations behave consistently.- QSV_SKIP_FORMAT_CHECK=1 \ - qsv select {columns_src_str} $TMPDIR/tmp.txt \ + qsv select {columns_src_str} $TMPDIR/tmp.txt \rules/work/genes/hgnc.smk (1)
15-18
: New HGNC URL works, but add fallback?Moving to the GCS mirror is fine; however the EBI FTP site has been historically stable. Consider keeping it as a secondary mirror or at least documenting the alternative to avoid future outages.
rules/output/annonars/alphamissense.smk (1)
26-30
: Threads constant may be disproportionate for single-file TSV import
THREADS
is now global. The AlphaMissense TSV import is single-threaded in annonars (& is bounded by I/O). Requesting, say, 96 cores wastes resources and can impact overall throughput when sharing a cluster.Either:
- Keep
threads
hard-coded at1
for this rule, or- Introduce a second constant (
IO_THREADS
/LIGHT_THREADS
) with a small default.rules/output/viguno/hpo.smk (1)
19-22
: FlatTHREADS
/MEMORY
may yield extremely over-allocated jobsThe rule only performs
cp
and a smallawk
, yet can now request the same slots and RAM as heavyweight processing steps. This may clog scheduling queues.Recommend lowering the footprint, e.g.:
-threads: THREADS -resources: - mem_mb=MEMORY, +threads: 1 +resources: + mem_mb=1024,rules/output/annonars/cons.smk (1)
15-18
: Same memory-scaling concern as other rulesSwitching from
mem_mb_per_cpu
to a fixedmem_mb=MEMORY
alters the effective per-CPU memory.
Please confirm that the new constant is sized appropriately or compute it fromTHREADS
.rules/output/annonars/gnomad_sv.smk (1)
18-22
: Global constants sound – double-check actual cluster submissionGood move to centralise
THREADS
/MEMORY
, but make sure the cluster profile you submit to interpretsmem_mb
(total) rather thanmem_mb_per_cpu
. Otherwise the requested RAM will differ from before.rules/output/annonars/gnomad_mtdna.smk (1)
19-22
: Fixed memory amount may under- or over-estimate requirementsAs with the other annonars rules, replacing
mem_mb_per_cpu
by a fixedmem_mb
changes total RAM allocation. Please verify thatMEMORY
is tuned to typical input sizes and thread counts.varfish_db_downloader/versions.py (1)
130-134
: Inline comments push lines over 100 charsThe added end-of-line comments on
ensembl_38
,ensembl
, anddbnsfp
are the usual
culprits for Black violations. Move them above the assignment:- ensembl_38="112", # keep at 112 for consistency with mehari - ensembl="112", # keep at 112 for consistency with mehari + # keep at 112 for consistency with Mehari + ensembl_38="112", + ensembl="112",Snakefile (1)
45-49
: Consider lowering the default memory value.The default memory of 196 GB seems quite high and may not be available on many systems. Consider a more conservative default like 32 GB (32000 MB).
-MEMORY = int(os.environ.get("MEMORY", "196000")) +MEMORY = int(os.environ.get("MEMORY", "32000"))
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (33)
Snakefile
(3 hunks)environment.yml
(1 hunks)rules/output/annonars/alphamissense.smk
(1 hunks)rules/output/annonars/cadd.smk
(1 hunks)rules/output/annonars/cons.smk
(1 hunks)rules/output/annonars/dbnsfp.smk
(1 hunks)rules/output/annonars/dbscsnv.smk
(1 hunks)rules/output/annonars/dbsnp.smk
(1 hunks)rules/output/annonars/gnomad_exomes.smk
(1 hunks)rules/output/annonars/gnomad_genomes.smk
(1 hunks)rules/output/annonars/gnomad_mtdna.smk
(1 hunks)rules/output/annonars/gnomad_sv.smk
(4 hunks)rules/output/annonars/helix.smk
(1 hunks)rules/output/mehari/freqs.smk
(2 hunks)rules/output/viguno/hpo.smk
(1 hunks)rules/reduced/annonars.smk
(1 hunks)rules/reduced/hpo.smk
(1 hunks)rules/reduced/mehari.smk
(1 hunks)rules/work/annos/features/tads.smk
(1 hunks)rules/work/annos/seqvars/dbnsfp.smk
(1 hunks)rules/work/annos/seqvars/dbscsnv.smk
(1 hunks)rules/work/annos/seqvars/gnomad_nuclear.smk
(1 hunks)rules/work/genes/ensembl.smk
(1 hunks)rules/work/genes/gnomad.smk
(1 hunks)rules/work/genes/hgnc.smk
(1 hunks)rules/work/genes/mehari_data_tx.smk
(1 hunks)rules/work/genes/omim.smk
(1 hunks)rules/work/genes/rcnv.smk
(1 hunks)rules/work/genes/shet.smk
(1 hunks)scripts/genes-integrate-diseases.py
(1 hunks)scripts/genes-orpha-diseases.py
(1 hunks)utils/install-annonars.sh
(0 hunks)varfish_db_downloader/versions.py
(3 hunks)
💤 Files with no reviewable changes (1)
- utils/install-annonars.sh
🧰 Additional context used
🪛 GitHub Actions: CI
varfish_db_downloader/versions.py
[error] 1-1: Black formatting check failed. File would be reformatted. Run 'black --line-length 100' to fix code style issues.
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Check-Remote-URLs
🔇 Additional comments (17)
rules/work/genes/omim.smk (1)
44-48
: Skipping qsv format checks may hide malformed input
QSV_SKIP_FORMAT_CHECK=1
disables CSV/TSV validation only for this join, but the earlier join at lines 32-39 still runs with checks enabled.
Confirm that the upstream files really contain broken rows that can be ignored; otherwise this setting could silently propagate bad data.Would you verify that all rows in
$TMPDIR/phenotype.hpoa
have the same delimiter count? If the issue is limited to a few outliers, pre-cleaning them is safer than suppressing checks globally.scripts/genes-integrate-diseases.py (1)
147-150
: Confirm downstream handling of newAspect.PAST_MEDICAL_HISTORY
valueThe new enum member extends the valid aspect set, but downstream code that switches/filters on
Aspect
(e.g., analytics, JSON serialisation in other scripts, database schemas) might still assume the old closed set.
Please verify that:
- All persisted data / DB schemas accept
"H"
values.- Any
match
/if
statements onAspect
now have a default/else branch or include the new value.- Unit tests cover parsing HPOA rows with
"H"
.No changes are required here if the above is already covered.
rules/work/genes/rcnv.smk (1)
26-35
: Addition ofQSV_SKIP_FORMAT_CHECK=1
looks goodThe environment variable is scoped to the pipeline command only and should silence the strict type-checking that previously broke on header mismatches. Nothing else changes; ✔️ from my side.
rules/work/annos/seqvars/dbscsnv.smk (1)
28-31
: Box URL may be ephemeral – verify accessibility or cacheSwitching from an FTP mirror to Box improves TLS security but Box links are often tokenised and can expire without notice.
Please run a quick HEAD check (or try downloading in CI) and ensure the file is publicly accessible long-term, or mirror it to an immutable location (e.g., Zenodo, GCS bucket).rules/work/annos/seqvars/dbnsfp.smk (1)
57-65
: Box links longevityAs with the dbscSNV rule, ensure both Box URLs are permanent shared links; otherwise CI builds may fail unexpectedly. If permanence cannot be guaranteed, consider hosting the archives on a stable mirror.
rules/reduced/hpo.smk (1)
59-62
: Potential semantic change:mem_mb
now denotes total memory, not per-CPU
mem_mb_per_cpu
→mem_mb
changes the meaning from per thread to aggregate.
With the previous default of2GB
×VIGUNO_SIMULATE_THREADS
, the job could rely on ~192 GB whenTHREADS ≈ 96
. Supplying the same literal value via a singleMEMORY
constant will cut the effective allocation dramatically and may trigger OOM on large workers.Consider deriving
MEMORY
fromTHREADS
, or keep the per-CPU semantics:# In Snakefile MEMORY = int(os.getenv("MEMORY_PER_CPU_MB", "2000")) * THREADSor switch back to
mem_mb_per_cpu
.Please double-check cluster profiles / executor configs to ensure this rule still receives the intended memory.
environment.yml (1)
39-48
: Pin versions consistently & add checksum for new critical toolsUpgrading
bcftools/htslib/samtools
to 1.21 is fine, but keep them strictly synchronised:- bcftools =1.21 - htslib =1.21 - samtools=1.21Also, the newly added
qsv
is unpinned. For reproducible builds pin a specific version tag and optionally provide an SHA256 checksum via theurl:
syntax.Please ensure that downstream rules still pass with the 1.21 toolchain (index formats changed in 1.20).
rules/output/mehari/freqs.smk (2)
8-9
: Correct input path swap looks goodThe genomes / exomes targets now point to the proper directories, fixing the earlier inversion bug. 👍
25-29
: ConfirmMEMORY
represents total memory, not per-CPU
mem_mb_per_cpu
was replaced withmem_mb=MEMORY
. Many cluster back-ends interpretmem_mb
as the total job memory.
IfMEMORY
still contains a “per-CPU” value (e.g. 2000 MB), the job will be severely under-provisioned (2 GB total instead of 2 GB × threads).
Please verify and adjust either the constant or switch back tomem_mb_per_cpu
.rules/reduced/annonars.smk (1)
41-45
: Resource constants adoption looks goodUsing the global
THREADS
andMEMORY
keeps resource management consistent across the workflow.rules/output/annonars/dbscsnv.smk (1)
24-28
: Unified resource declaration approvedSwitching to
THREADS
/MEMORY
aligns this rule with the new global convention.rules/output/annonars/dbsnp.smk (1)
15-19
: Consistent resource constantsThe update is straightforward and correct.
rules/output/annonars/dbnsfp.smk (1)
22-26
: Resource constants update OKNo issues spotted with the transition to
THREADS
/MEMORY
.rules/output/annonars/gnomad_genomes.smk (1)
19-22
: Confirm that globalTHREADS
/MEMORY
resolve to integersSwapping the per-rule environment look-ups for the shared constants is fine, but please double-check that
THREADS
andMEMORY
are defined before this file is parsed and that both end up as plain integers (not strings).
Several cluster back-ends will silently ignore non-numeric values and the job may be scheduled with default resources.If you are unsure, add a quick assertion in the top-level
Snakefile
:assert isinstance(THREADS, int) and isinstance(MEMORY, int)rules/work/genes/ensembl.smk (1)
23-24
: Verify archived URL correctness
https://may2024.archive.ensembl.org/biomart/martservice?...
looks correct today,
but archive sub-domains change over time. Consider capturing the files once and
hosting on a stable bucket or at least validating reachability at workflow start.rules/work/genes/mehari_data_tx.smk (2)
4-12
: LGTM!The addition of the
{refseq_or_ensembl}
wildcard correctly extends the file naming pattern to distinguish between RefSeq and Ensembl annotations.
15-26
: LGTM!The copy rule correctly mirrors the download rule's wildcard pattern for consistency.
wget --no-check-certificate \ | ||
-O {output.zip} \ | ||
http://3dgenome.fsm.northwestern.edu/downloads/$name.TADs.zip | ||
https://3dgenome.fsm.northwestern.edu/downloads/$name.TADs.zip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
--no-check-certificate
defeats the purpose of switching to HTTPS
Now that the download uses https://
, leaving --no-check-certificate
in place disables all certificate validation and yields the same security level as plain HTTP.
Unless the upstream host really serves an invalid certificate, drop the flag and fail fast on TLS errors.
- wget --no-check-certificate \
+ wget \
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
wget --no-check-certificate \ | |
-O {output.zip} \ | |
http://3dgenome.fsm.northwestern.edu/downloads/$name.TADs.zip | |
https://3dgenome.fsm.northwestern.edu/downloads/$name.TADs.zip | |
wget \ | |
-O {output.zip} \ | |
https://3dgenome.fsm.northwestern.edu/downloads/$name.TADs.zip |
🤖 Prompt for AI Agents
In rules/work/annos/features/tads.smk around lines 15 to 17, the wget command
uses the --no-check-certificate flag which disables TLS certificate validation,
negating the security benefits of HTTPS. Remove the --no-check-certificate flag
from the wget command so that certificate validation is enforced and the
download fails fast on TLS errors.
threads: THREADS | ||
resources: | ||
runtime=os.environ.get("RUNTIME_ANNONARS_IMPORT", "48h"), | ||
mem_mb_per_cpu=2000, | ||
mem_mb=MEMORY, | ||
wildcard_constraints: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Risk of under-provisioning memory for large annonars imports
Similar to the previous file, replacing mem_mb_per_cpu=2000
with a flat mem_mb=MEMORY
changes the contract with the scheduler. Annonars import often requires >100 GB RAM for large datasets; if MEMORY
is not scaled with THREADS
you may see failures late in the run.
- mem_mb=MEMORY,
+ mem_mb=int(os.getenv("MEMORY_PER_CPU_MB", "2000")) * THREADS,
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
threads: THREADS | |
resources: | |
runtime=os.environ.get("RUNTIME_ANNONARS_IMPORT", "48h"), | |
mem_mb_per_cpu=2000, | |
mem_mb=MEMORY, | |
wildcard_constraints: | |
threads: THREADS | |
resources: | |
runtime=os.environ.get("RUNTIME_ANNONARS_IMPORT", "48h"), | |
mem_mb=int(os.getenv("MEMORY_PER_CPU_MB", "2000")) * THREADS, | |
wildcard_constraints: |
🤖 Prompt for AI Agents
In rules/output/annonars/helix.smk around lines 19 to 23, the memory allocation
was changed from mem_mb_per_cpu=2000 to a fixed mem_mb=MEMORY, which risks
under-provisioning memory for large annonars imports. To fix this, revert to
using mem_mb_per_cpu scaled by THREADS or ensure MEMORY is dynamically
calculated as THREADS multiplied by the per-CPU memory requirement to provide
sufficient total memory for the job.
rules/reduced/mehari.smk
Outdated
threads: THREADS | ||
resources: | ||
runtime=os.environ.get("RUNTIME_ANNONARS_IMPORT", "48h"), | ||
mem_mb_per_cpu=2000, | ||
mem_mb=MEMORY | ||
shell: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Memory now fixed instead of scaling with thread count – risk of under-/overallocation
Previously the rule allocated mem_mb_per_cpu=2000
, i.e. memory scaled linearly with the requested CPU threads.
After the refactor the rule requests a fixed mem_mb=MEMORY
while still allowing THREADS
CPUs.
If THREADS
is left at its former default (96) but MEMORY
is, for example, 128 000 MB, each core effectively only gets ~1.3 GB instead of the former 2 GB. Conversely, small thread counts might now over-allocate RAM.
Please ensure that
MEMORY ≈ THREADS * former_per_cpu_mem
or introduce a MEMORY_PER_THREAD
constant and compute mem_mb=THREADS * MEMORY_PER_THREAD
to preserve the previous behaviour.
I’d like to inspect how THREADS
and MEMORY
are defined (and whether any mem_mb_per_cpu
was used before) to confirm if memory really isn’t scaling anymore. Let me pull the relevant parts of mehari.smk:
🏁 Script executed:
#!/bin/bash
set -eux
# Check if mem_mb_per_cpu is still used anywhere
rg -n "mem_mb_per_cpu"
# Show the resource block in mehari.smk
rg -n "mem_mb=" -A2 rules/reduced/mehari.smk
rg -n "threads:" -A2 rules/reduced/mehari.smk
# Locate where THREADS and MEMORY are defined
rg -n "THREADS" -A2 rules/reduced/mehari.smk
rg -n "MEMORY" -A2 rules/reduced/mehari.smk
# Show the first 200 lines to catch default assignments
head -n 200 rules/reduced/mehari.smk
Length of output: 33
I’ll first locate the Snakemake file and inspect both its resource block and any prior use of mem_mb_per_cpu
without exiting on no-matches:
🏁 Script executed:
#!/bin/bash
set -uxo pipefail
# 1. Find the mehari.smk file
FILE=$(find . -type f -name mehari.smk | head -n1)
echo "Found mehari.smk at: $FILE"
# 2. Check for any legacy mem_mb_per_cpu usage
rg -n "mem_mb_per_cpu" . || echo "→ no mem_mb_per_cpu found"
# 3. Show the resource stanza around threads/resources/mem_mb
rg -n "threads:" -A3 "$FILE" || true
rg -n "resources:" -A5 "$FILE" || true
rg -n "mem_mb=" -A3 "$FILE" || true
# 4. Locate where THREADS and MEMORY are defined or defaulted
rg -n "THREADS" -A3 "$FILE" || true
rg -n "MEMORY" -A3 "$FILE" || true
# 5. Show the first 200 lines to catch any defaults or comments
head -n 200 "$FILE"
Length of output: 3404
Ensure memory scales with thread count
We’ve confirmed that mem_mb_per_cpu
is no longer used in rules/reduced/mehari.smk (lines 40–44) and the rule now requests a fixed mem_mb=MEMORY
regardless of THREADS
. This change decouples RAM from CPU count and can lead to under- or over-allocation.
Please restore the previous scaling behavior (2 GB per CPU) by choosing one of the following:
- Introduce a
MEMORY_PER_THREAD = 2000
constant and update the resource block to:threads: THREADS resources:
-
mem_mb=MEMORY
-
mem_mb=THREADS * MEMORY_PER_THREAD
- Or ensure the default
MEMORY
value is defined asTHREADS * 2000
.
• File: rules/reduced/mehari.smk (lines 40–44)
Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In rules/reduced/mehari.smk around lines 40 to 44, the memory allocation no
longer scales with the number of CPU threads, causing potential under- or
over-allocation. Fix this by either defining a constant MEMORY_PER_THREAD set to
2000 and updating the resource block to mem_mb=THREADS * MEMORY_PER_THREAD, or
ensure that the MEMORY variable is assigned as THREADS multiplied by 2000 before
being used. This will restore the previous behavior of allocating 2 GB of memory
per CPU thread.
#if wildcards.version == DV.gnomad_v4: | ||
tpl = "work/download/annos/grch38/seqvars/gnomad_{kind}/{version}/gnomad.{kind}.v{version}.sites.chr{chrom}.vcf.bgz" | ||
return [tpl.format(chrom=chrom, **wildcards) for chrom in chroms] | ||
# else: | ||
# if wildcards.kind == "exomes": | ||
# tpl = "work/download/annos/grch38/seqvars/gnomad_{kind}/{version}/gnomad.{kind}.r{version}.sites.{chrom}.liftover_grch38.vcf.bgz" | ||
# return [ | ||
# tpl.format(kind=wildcards.kind, version=DV.gnomad_v2, chrom=chrom) | ||
# for chrom in chroms | ||
# ] | ||
# else: | ||
# tpl = "work/download/annos/grch38/seqvars/gnomad_{kind}/{version}/gnomad.{kind}.v{version}.sites.chr{chrom}.vcf.bgz" | ||
# return [ | ||
# tpl.format(kind=wildcards.kind, version=DV.gnomad_v3, chrom=chrom) | ||
# for chrom in chroms | ||
# ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Input path logic for gnomAD-v2 GRCh38 now broken
The simplified implementation unconditionally builds paths of the form
.../vcf/{kind}/gnomad.{kind}.v{version}.sites.chr{chrom}.vcf.bgz
For GRCh38 v2 exomes the files live under
liftover_grch38/vcf/{kind}/gnomad.{kind}.r{version}.sites.{chrom}.liftover_grch38.vcf.bgz
as still reflected by the dedicated download rule annos_gnomad_nuclear_download_grch38_liftover_v2
above.
Thus the workflow will raise “missing input file” errors for any target requiring gnomAD-v2 exomes/genomes on GRCh38.
def input_annos_gnomad_nuclear_grch38(wildcards):
@@
- #if wildcards.version == DV.gnomad_v4:
- tpl = "work/download/annos/grch38/seqvars/gnomad_{kind}/{version}/gnomad.{kind}.v{version}.sites.chr{chrom}.vcf.bgz"
- return [tpl.format(chrom=chrom, **wildcards) for chrom in chroms]
+# Handle the three different layouts explicitly
+ if wildcards.version == DV.gnomad_v4:
+ tpl = (
+ "work/download/annos/grch38/seqvars/gnomad_{kind}/{version}/"
+ "gnomad.{kind}.v{version}.sites.chr{chrom}.vcf.bgz"
+ )
+ elif wildcards.version == DV.gnomad_v3:
+ tpl = (
+ "work/download/annos/grch38/seqvars/gnomad_{kind}/{version}/"
+ "gnomad.{kind}.v{version}.sites.chr{chrom}.vcf.bgz"
+ )
+ else: # v2 exomes/genomes – liftover files
+ tpl = (
+ "work/download/annos/grch38/seqvars/gnomad_{kind}/{version}/"
+ "gnomad.{kind}.r{version}.sites.{chrom}.liftover_grch38.vcf.bgz"
+ )
+ return [tpl.format(chrom=chrom, **wildcards) for chrom in chroms]
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
#if wildcards.version == DV.gnomad_v4: | |
tpl = "work/download/annos/grch38/seqvars/gnomad_{kind}/{version}/gnomad.{kind}.v{version}.sites.chr{chrom}.vcf.bgz" | |
return [tpl.format(chrom=chrom, **wildcards) for chrom in chroms] | |
# else: | |
# if wildcards.kind == "exomes": | |
# tpl = "work/download/annos/grch38/seqvars/gnomad_{kind}/{version}/gnomad.{kind}.r{version}.sites.{chrom}.liftover_grch38.vcf.bgz" | |
# return [ | |
# tpl.format(kind=wildcards.kind, version=DV.gnomad_v2, chrom=chrom) | |
# for chrom in chroms | |
# ] | |
# else: | |
# tpl = "work/download/annos/grch38/seqvars/gnomad_{kind}/{version}/gnomad.{kind}.v{version}.sites.chr{chrom}.vcf.bgz" | |
# return [ | |
# tpl.format(kind=wildcards.kind, version=DV.gnomad_v3, chrom=chrom) | |
# for chrom in chroms | |
# ] | |
# Handle the three different layouts explicitly | |
if wildcards.version == DV.gnomad_v4: | |
tpl = ( | |
"work/download/annos/grch38/seqvars/gnomad_{kind}/{version}/" | |
"gnomad.{kind}.v{version}.sites.chr{chrom}.vcf.bgz" | |
) | |
elif wildcards.version == DV.gnomad_v3: | |
tpl = ( | |
"work/download/annos/grch38/seqvars/gnomad_{kind}/{version}/" | |
"gnomad.{kind}.v{version}.sites.chr{chrom}.vcf.bgz" | |
) | |
else: # v2 exomes/genomes – liftover files | |
tpl = ( | |
"work/download/annos/grch38/seqvars/gnomad_{kind}/{version}/" | |
"gnomad.{kind}.r{version}.sites.{chrom}.liftover_grch38.vcf.bgz" | |
) | |
return [tpl.format(chrom=chrom, **wildcards) for chrom in chroms] |
🤖 Prompt for AI Agents
In rules/work/annos/seqvars/gnomad_nuclear.smk around lines 114 to 129, the
current path logic for gnomAD v2 exomes on GRCh38 is incorrect and causes
missing input file errors. Restore the conditional logic to differentiate v2
exomes by checking if wildcards.kind is "exomes" and use the liftover_grch38
path template with the correct filename pattern including "r{version}" and
"liftover_grch38" suffix. For other cases, use the appropriate templates for v3
or v4 versions as before. This will ensure the correct input paths are generated
for all gnomAD versions and kinds.
rules/work/genes/ensembl.smk
Outdated
https://may2024.archive.ensembl.org/pub/current_README | ||
grep "The current release is Ensembl {DV.ensembl}" $TMPDIR/current_README \ | ||
|| (echo "Ensembl version is not {DV.ensembl}." && exit 1) | ||
|| (echo "WARNING: Ensembl version is not {DV.ensembl}.") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brace expansion may not interpolate DV.ensembl
here
shell:
strings are not prefixed with f
, so Python itself will not substitute
{DV.ensembl}
. Snakemake’s shell()
formatter only expands {wildcards}
, {input}
,
{output}
, {threads}
, {resources}
, and {params}
by default.
DV
is a plain global variable and will therefore stay literally {DV.ensembl}
inside the
command, resulting in the grep always failing.
Two quick fixes (choose one):
- grep "The current release is Ensembl {DV.ensembl}" $TMPDIR/current_README \
+ grep "The current release is Ensembl {params.ensembl_ver}" $TMPDIR/current_README \
and then pass the version via params: ensembl_ver=DV.ensembl
or simply turn the whole block into an f-string:
- shell:
- r"""
+ shell:
+ fr"""
so normal Python formatting fills in DV.ensembl
.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
https://may2024.archive.ensembl.org/pub/current_README | |
grep "The current release is Ensembl {DV.ensembl}" $TMPDIR/current_README \ | |
|| (echo "Ensembl version is not {DV.ensembl}." && exit 1) | |
|| (echo "WARNING: Ensembl version is not {DV.ensembl}.") | |
shell: | |
fr""" | |
https://may2024.archive.ensembl.org/pub/current_README | |
grep "The current release is Ensembl {DV.ensembl}" $TMPDIR/current_README \ | |
|| (echo "WARNING: Ensembl version is not {DV.ensembl}.") | |
""" |
🤖 Prompt for AI Agents
In rules/work/genes/ensembl.smk around lines 15 to 18, the shell command uses
{DV.ensembl} which is not expanded because Snakemake's shell formatter does not
substitute plain global variables. To fix this, either pass DV.ensembl as a
parameter using params: ensembl_ver=DV.ensembl and reference it as
{params.ensembl_ver} in the shell command, or convert the shell command string
into a Python f-string so that {DV.ensembl} is properly interpolated before
execution.
CLINVAR_RELEASE = os.environ.get("CLINVAR_RELEASE", "20250410+0.18.5") | ||
#: The ClinVar version to use (part of the tag and release name). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pipeline is red – Black re-formatting required
CI failed with “Black formatting check failed”. Please run:
black --line-length 100 varfish_db_downloader/versions.py
and commit the result. No functional issues detected – this is purely styling.
🤖 Prompt for AI Agents
In varfish_db_downloader/versions.py around lines 14 to 15, the code does not
comply with Black formatting standards, causing the CI to fail. Run the command
"black --line-length 100 varfish_db_downloader/versions.py" to automatically
reformat the file according to Black's style guidelines, then commit the updated
file to fix the styling issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
Snakefile (1)
90-94
: Removal of unusedwildcards
parameter is spot-onThe function signature is now minimal and the call site was updated accordingly. Nice cleanup.
🧹 Nitpick comments (2)
Snakefile (2)
45-50
: Hard-coded resource defaults are overly aggressive
THREADS = 32
and especiallyMEMORY = 196000
MB (~196 GB) can overwhelm CI runners or laptops when the corresponding env-vars are unset. Consider:
- Choosing safer fall-backs (e.g. 8 threads, 32000 MB), or
- Aborting with a descriptive error if the vars are not provided.
This avoids unexpected OOM or oversubscription.
94-97
: Normalise genome-build strings to lower-caseIf the user exports
GENOMEBUILDS="GRCh37"
, the later checks
"grch37" in genomebuilds
will fail. Convert to lower-case once:- genomebuilds = GENOMEBUILDS + genomebuilds = [gb.lower() for gb in GENOMEBUILDS]
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (16)
Snakefile
(3 hunks)download_urls.yml
(1 hunks)rules/output/annonars/cadd.smk
(1 hunks)rules/output/annonars/cons.smk
(1 hunks)rules/output/annonars/dbnsfp.smk
(1 hunks)rules/output/annonars/dbscsnv.smk
(1 hunks)rules/output/annonars/gnomad_exomes.smk
(1 hunks)rules/output/annonars/gnomad_genomes.smk
(1 hunks)rules/output/annonars/gnomad_mtdna.smk
(1 hunks)rules/output/annonars/gnomad_sv.smk
(4 hunks)rules/output/mehari/freqs.smk
(2 hunks)rules/output/viguno/hpo.smk
(1 hunks)rules/reduced/annonars.smk
(1 hunks)rules/reduced/mehari.smk
(1 hunks)rules/work/annos/seqvars/gnomad_nuclear.smk
(1 hunks)varfish_db_downloader/versions.py
(3 hunks)
✅ Files skipped from review due to trivial changes (2)
- download_urls.yml
- rules/work/annos/seqvars/gnomad_nuclear.smk
🚧 Files skipped from review as they are similar to previous changes (13)
- rules/output/annonars/gnomad_mtdna.smk
- rules/reduced/mehari.smk
- rules/reduced/annonars.smk
- rules/output/annonars/cons.smk
- rules/output/mehari/freqs.smk
- rules/output/annonars/cadd.smk
- rules/output/viguno/hpo.smk
- rules/output/annonars/gnomad_genomes.smk
- rules/output/annonars/dbnsfp.smk
- rules/output/annonars/gnomad_exomes.smk
- rules/output/annonars/dbscsnv.smk
- varfish_db_downloader/versions.py
- rules/output/annonars/gnomad_sv.smk
🔇 Additional comments (1)
Snakefile (1)
180-181
: ValidateGENOMEBUILDS
early to avoidKeyError
Look-ups such as
gnomad_versions[genomebuild]
assume onlygrch37
orgrch38
.
If someone passes another build, the workflow crashes deep insidegenerate_input_files
.
Add an upfront check and fail fast with a clear message.unknown = set(genomebuilds) - {"grch37", "grch38"} if unknown: raise ValueError(f"Unsupported genome build(s): {', '.join(unknown)}")
GENOMEBUILDS = os.environ.get("GENOMEBUILDS", "grch37,grch38").split(",") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing import os
causes runtime failure
os.environ
is referenced here but os
is never imported in this file (it’s also used later at line 56). This will raise NameError
when the Snakefile is parsed.
-from varfish_db_downloader.versions import (
+import os
+from varfish_db_downloader.versions import (
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
GENOMEBUILDS = os.environ.get("GENOMEBUILDS", "grch37,grch38").split(",") | |
import os | |
from varfish_db_downloader.versions import ( | |
VERSION, | |
) | |
GENOMEBUILDS = os.environ.get("GENOMEBUILDS", "grch37,grch38").split(",") |
🤖 Prompt for AI Agents
In Snakefile around lines 43 to 44, the code uses os.environ but the os module
is not imported, causing a NameError at runtime. Add an import statement for the
os module at the top of the Snakefile to fix this issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (2)
excerpt-data/e8a29ff831bf8ee3/hgnc_complete_set.json (1)
1-3
: Ignore LFS pointer file in JSON linting
This file is a Git LFS pointer, not actual JSON—Biome’s errors here are false positives. Please update your lint configuration (e.g., addexcerpt-data/**/*.json
to.biomeignore
or adjust.gitattributes
) so these pointer files are excluded from JSON parsing.varfish_db_downloader/wget.py (1)
62-64
: Use logger instead of raw
print()
bypasses the project’s structured logging and can be swallowed in daemonised contexts.- print(f"Error fetching URL: {e}") + logger.error("Error fetching URL {}: {}", url, e)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (12)
excerpt-data/0ccc4915e7ecfd38/mehari-data-txs-GRCh37-refseq-0.10.3.bin.zst
is excluded by!**/*.zst
excerpt-data/18c67f3b60f24667/mehari-data-txs-GRCh38-refseq-0.10.3.bin.zst
is excluded by!**/*.zst
excerpt-data/1963f3c58ea066be/omim_unmapped_terms.tsv
is excluded by!**/*.tsv
excerpt-data/5bcb7090fe0e881b/mehari-data-txs-GRCh37-ensembl-0.10.3.bin.zst
is excluded by!**/*.zst
excerpt-data/615312ce3f5fc1bf/OMIMinDO.tsv
is excluded by!**/*.tsv
excerpt-data/617bebe58c82f24e/CTD_diseases.tsv.gz
is excluded by!**/*.gz
excerpt-data/649dbe83da5f50d9/mehari-data-txs-GRCh38-ensembl-0.10.3.bin.zst
is excluded by!**/*.zst
excerpt-data/b31e9b26b7aeae3f/dbNSFP4.5c.zip
is excluded by!**/*.zip
excerpt-data/bdc69c1e4cafdfaa/clinvar-data-extract-vars-20240612+0.17.0.tar.gz
is excluded by!**/*.gz
excerpt-data/c243f918f5fc10d2/dbNSFP4.5a.zip
is excluded by!**/*.zip
excerpt-data/caba5539b2dac784/clinvar-data-extract-vars-20250410+0.18.5.tar.gz
is excluded by!**/*.gz
excerpt-data/cbc00fe1a5b22fe8/dbscSNV1.1.zip
is excluded by!**/*.zip
📒 Files selected for processing (20)
download_urls.yml
(3 hunks)excerpt-data/0ccc4915e7ecfd38/url.txt
(1 hunks)excerpt-data/111d8c6e08038f62/20
(1 hunks)excerpt-data/18c67f3b60f24667/url.txt
(1 hunks)excerpt-data/5bcb7090fe0e881b/url.txt
(1 hunks)excerpt-data/649dbe83da5f50d9/url.txt
(1 hunks)excerpt-data/652646c24140df2a/mondo.obo
(1 hunks)excerpt-data/6f378db589a4bbb9/orphacodes
(1 hunks)excerpt-data/91f964d1aa8367a5/20
(1 hunks)excerpt-data/b31e9b26b7aeae3f/url.txt
(1 hunks)excerpt-data/bdc69c1e4cafdfaa/url.txt
(0 hunks)excerpt-data/c243f918f5fc10d2/url.txt
(1 hunks)excerpt-data/c9c7d6df0e24b954/__index__
(1 hunks)excerpt-data/caba5539b2dac784/url.txt
(1 hunks)excerpt-data/cbc00fe1a5b22fe8/url.txt
(1 hunks)excerpt-data/cdaaf7a3f7595d3d/__index__
(1 hunks)excerpt-data/e8a29ff831bf8ee3/hgnc_complete_set.json
(1 hunks)excerpt-data/e8a29ff831bf8ee3/url.txt
(1 hunks)excerpt-data/ebc07f725c64907d/__index__
(1 hunks)varfish_db_downloader/wget.py
(4 hunks)
💤 Files with no reviewable changes (1)
- excerpt-data/bdc69c1e4cafdfaa/url.txt
✅ Files skipped from review due to trivial changes (16)
- excerpt-data/649dbe83da5f50d9/url.txt
- excerpt-data/91f964d1aa8367a5/20
- excerpt-data/c9c7d6df0e24b954/index
- excerpt-data/e8a29ff831bf8ee3/url.txt
- excerpt-data/cbc00fe1a5b22fe8/url.txt
- excerpt-data/0ccc4915e7ecfd38/url.txt
- excerpt-data/cdaaf7a3f7595d3d/index
- excerpt-data/b31e9b26b7aeae3f/url.txt
- excerpt-data/5bcb7090fe0e881b/url.txt
- excerpt-data/6f378db589a4bbb9/orphacodes
- excerpt-data/c243f918f5fc10d2/url.txt
- excerpt-data/18c67f3b60f24667/url.txt
- excerpt-data/caba5539b2dac784/url.txt
- excerpt-data/111d8c6e08038f62/20
- excerpt-data/ebc07f725c64907d/index
- excerpt-data/652646c24140df2a/mondo.obo
🚧 Files skipped from review as they are similar to previous changes (1)
- download_urls.yml
🧰 Additional context used
🪛 Ruff (0.11.9)
varfish_db_downloader/wget.py
15-15: urllib.parse.urlparse
imported but unused
Remove unused import: urllib.parse.urlparse
(F401)
🪛 GitHub Actions: CI
varfish_db_downloader/wget.py
[error] 15-15: flake8: 'urllib.parse.urlparse' imported but unused (F401)
🪛 Biome (1.9.4)
excerpt-data/e8a29ff831bf8ee3/hgnc_complete_set.json
[error] 1-1: String values must be double quoted.
(parse)
[error] 1-1: String values must be double quoted.
(parse)
[error] 1-3: End of file expected
Use an array for a sequence of values: [1, 2]
(parse)
[error] 3-3: End of file expected
Use an array for a sequence of values: [1, 2]
(parse)
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: Check-Remote-URLs
def get_filename_from_url(url: str) -> str | None: | ||
""" | ||
Extracts the filename from a URL. | ||
|
||
First, it tries to get the filename from the 'Content-Disposition' header. | ||
If not found, it falls back to parsing the URL path. | ||
|
||
Args: | ||
url: The URL of the file to inspect. | ||
|
||
Returns: | ||
The detected filename as a string, or None if an error occurs | ||
or no filename can be found. | ||
""" | ||
try: | ||
# We use stream=True to avoid downloading the file content | ||
with requests.get(url, allow_redirects=True, stream=True) as response: | ||
# Raise an exception for bad status codes (4xx or 5xx) | ||
response.raise_for_status() | ||
|
||
# 1. Check for Content-Disposition header | ||
content_disposition = response.headers.get("content-disposition") | ||
if content_disposition: | ||
# Parse the header to extract the filename | ||
_, params = cgi.parse_header(content_disposition) | ||
if "filename" in params: | ||
return params["filename"] | ||
|
||
# 2. If no header, fall back to the URL path | ||
path = urllib.parse.urlparse( | ||
response.url | ||
).path # Use response.url to get the final URL after redirects | ||
filename = os.path.basename(path) | ||
if filename: | ||
return filename | ||
|
||
except requests.exceptions.RequestException as e: | ||
print(f"Error fetching URL: {e}") | ||
return None | ||
|
||
# 3. If all else fails | ||
return None | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Modernise get_filename_from_url
—avoid deprecated cgi
, add timeout & cheaper HEAD
probe.
cgi.parse_header
has been deprecated since 3.11 and will be removed in 3.13 (PEP 594).- A plain
GET
(even withstream=True
) is still heavier than aHEAD
; many servers provide the same headers with no body. - No timeout ⇒ risk of hanging download workers.
- Handling of RFC 5987
filename*=
parameters is missing.
-import cgi
+from email.message import Message
- with requests.get(url, allow_redirects=True, stream=True) as response:
+ # Cheap probe: try HEAD first, fall back to GET if server forbids it.
+ try:
+ response = requests.head(url, allow_redirects=True, timeout=10)
+ if response.status_code == 405: # Method Not Allowed
+ response = requests.get(url, allow_redirects=True, stream=True, timeout=10)
+ except requests.exceptions.RequestException:
+ # second attempt with GET already failed (or HEAD failed hard)
+ raise
+ # Use context-manager semantics
+ with response:
- content_disposition = response.headers.get("content-disposition")
- if content_disposition:
- # Parse the header to extract the filename
- _, params = cgi.parse_header(content_disposition)
- if "filename" in params:
- return params["filename"]
+ cd_header = response.headers.get("content-disposition")
+ if cd_header:
+ msg = Message()
+ msg["content-disposition"] = cd_header
+ params = msg.get_params(header="content-disposition", failobj=[])
+ for key, val in params:
+ if key in ("filename", "filename*"):
+ # RFC 5987 handling for filename*=UTF-8''name.ext
+ return urllib.parse.unquote(val.split("''")[-1])
These tweaks keep us future-proof and more efficient.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def get_filename_from_url(url: str) -> str | None: | |
""" | |
Extracts the filename from a URL. | |
First, it tries to get the filename from the 'Content-Disposition' header. | |
If not found, it falls back to parsing the URL path. | |
Args: | |
url: The URL of the file to inspect. | |
Returns: | |
The detected filename as a string, or None if an error occurs | |
or no filename can be found. | |
""" | |
try: | |
# We use stream=True to avoid downloading the file content | |
with requests.get(url, allow_redirects=True, stream=True) as response: | |
# Raise an exception for bad status codes (4xx or 5xx) | |
response.raise_for_status() | |
# 1. Check for Content-Disposition header | |
content_disposition = response.headers.get("content-disposition") | |
if content_disposition: | |
# Parse the header to extract the filename | |
_, params = cgi.parse_header(content_disposition) | |
if "filename" in params: | |
return params["filename"] | |
# 2. If no header, fall back to the URL path | |
path = urllib.parse.urlparse( | |
response.url | |
).path # Use response.url to get the final URL after redirects | |
filename = os.path.basename(path) | |
if filename: | |
return filename | |
except requests.exceptions.RequestException as e: | |
print(f"Error fetching URL: {e}") | |
return None | |
# 3. If all else fails | |
return None | |
# --- at the top of the file, replace the old import --- | |
-from cgi import parse_header | |
+from email.message import Message | |
def get_filename_from_url(url: str) -> str | None: | |
""" | |
Extracts the filename from a URL. | |
First, it tries to get the filename from the 'Content-Disposition' header. | |
If not found, it falls back to parsing the URL path. | |
Args: | |
url: The URL of the file to inspect. | |
Returns: | |
The detected filename as a string, or None if an error occurs | |
or no filename can be found. | |
""" | |
try: | |
# Cheap probe: try HEAD first, fall back to GET if server forbids it. | |
try: | |
response = requests.head(url, allow_redirects=True, timeout=10) | |
if response.status_code == 405: # Method Not Allowed | |
response = requests.get(url, allow_redirects=True, stream=True, timeout=10) | |
except requests.exceptions.RequestException: | |
# second attempt (or HEAD itself) failed hard | |
raise | |
# Ensure the connection is closed when we're done | |
with response: | |
# Raise for 4xx/5xx | |
response.raise_for_status() | |
# 1. Check for Content-Disposition header (with RFC 5987 support) | |
cd_header = response.headers.get("content-disposition") | |
if cd_header: | |
msg = Message() | |
msg["content-disposition"] = cd_header | |
params = msg.get_params(header="content-disposition", failobj=[]) | |
for key, val in params: | |
if key in ("filename", "filename*"): | |
# RFC 5987: filename*=UTF-8''name.ext | |
return urllib.parse.unquote(val.split("''")[-1]) | |
# 2. If no header, fall back to the URL path | |
path = urllib.parse.urlparse(response.url).path | |
filename = os.path.basename(path) | |
if filename: | |
return filename | |
except requests.exceptions.RequestException as e: | |
print(f"Error fetching URL: {e}") | |
return None | |
# 3. If all else fails | |
return None |
🤖 Prompt for AI Agents
In varfish_db_downloader/wget.py from lines 26 to 68, update
get_filename_from_url to replace the deprecated cgi.parse_header with a modern
alternative like email.message.Message for parsing headers, add a timeout
parameter to the requests call to prevent hanging, switch from using a GET
request to a HEAD request to reduce bandwidth usage while still retrieving
headers, and enhance filename extraction by handling RFC 5987 encoded filename*
parameters in the Content-Disposition header for better compatibility.
Summary by CodeRabbit
New Features
Bug Fixes
Improvements
Chores