Conversation
|
Is the config file really necessary? |
dlaehnemann
left a comment
There was a problem hiding this comment.
This looks good already. But of course, I have a bunch of suggestions. Maybe we can work on those together during the symposium or even the workshop.
|
|
||
| rule get_inphareddb: | ||
| output: | ||
| expand("{date}{suffix}", date=config["date"], suffix=config["suffix"]) |
There was a problem hiding this comment.
One of the things we try with these wrappers, is to have them work with arbitrary file names. So the config[] entries should not be used in this example Snakefile, but rather in the wrapper.py file (via snakemake.params.date, for example).
| expand("{date}{suffix}", date=config["date"], suffix=config["suffix"]) | |
| "resources/inphared.fasta" |
To add the values in the config variables back into the file name, the users of the wrapper should then add python code around it. But I also like the idea of directly showcasing how to do that here. So maybe we could have two versions of calling the wrapper here, one with a fixed file name (like i suggest here), and one that contains the config[] entries.
| @@ -0,0 +1,4 @@ | |||
| name: inphared-db | |||
| description: Download sequence file from the Inphared database (https://github.com/RyanCook94/inphared/blob/main/README.md), and store them in a single .fasta file. Please check the current database available at the above link and adjust the config file. | |||
There was a problem hiding this comment.
| description: Download sequence file from the Inphared database (https://github.com/RyanCook94/inphared/blob/main/README.md), and store them in a single .fasta file. Please check the current database available at the above link and adjust the config file. | |
| description: Download sequence file from the [inphared database](https://github.com/RyanCook94/inphared/blob/main/README.md), and store them in a single .fasta file. Please check the above link for available database version and adjust the config file. |
| output: | ||
| expand("{date}{suffix}", date=config["date"], suffix=config["suffix"]) | ||
| params: | ||
| prefix = config["prefix"], |
There was a problem hiding this comment.
Maybe rename this to url:, as that might make its purpose a bit clearer?
| prefix = config["prefix"], | |
| url = config["url"], |
Obviously, the same rename then applies in the config.yaml file.
| @@ -0,0 +1,29 @@ | |||
| rule get_genome: | |||
There was a problem hiding this comment.
This file can be deleted here, right? It's simply a copy-paste leftover, if I understand this correctly.
| @@ -0,0 +1,80 @@ | |||
| __author__ = "Johannes Köster" | |||
There was a problem hiding this comment.
This file can simply be deleted here, right?
| @@ -0,0 +1,30 @@ | |||
| rule get_genome: | |||
There was a problem hiding this comment.
This file can simply be deleted here, right?
| shell: | ||
| "curl {params.prefix}{params.date}{params.suffix} -o {params.date}{params.suffix}" |
There was a problem hiding this comment.
Some little things here:
- You have to reference anything from the snakemake rule via the
snakemakedict. So for examplesnakemake.params.date. - We should only use the full
outputfile name here, so that users can put in there whatever they want. - We have to call the
shell()function here, as this is not the body of an actual rule, but rather a plain python script. - We have to use the
f""construct here to use format strings to fill in the variables in this context. Here, we are not dealing with snakemake wildcards, but rather python variables. The{}syntax is the same, so this is very confusing... - I would manually put in the separator between URL and file name here, just for a slightly clearer structure of the download link. Then, we can remove the trailing
/in the url inconfig.yaml.
| shell: | |
| "curl {params.prefix}{params.date}{params.suffix} -o {params.date}{params.suffix}" | |
| shell(f"curl {snakemake.params.url}/{snakemake.params.date}{snakemake.params.suffix} -o {snakemake.output}" |
| prefix: | ||
| "https://millardlab-inphared.s3.climb.ac.uk/" |
There was a problem hiding this comment.
To keep this in sync with the suggestions elsewhere, we would change this to:
| prefix: | |
| "https://millardlab-inphared.s3.climb.ac.uk/" | |
| url: | |
| "https://millardlab-inphared.s3.climb.ac.uk" |
| "2Jul2023" | ||
|
|
||
| suffix: | ||
| "_refseq_genomes.fa" |
There was a problem hiding this comment.
Is there a smaller reference fasta file (of some small subset of viruses, e.g.), that could be used in the example? This should optimally get executed in the CI tests regularly, so shouldn't download much, if possible.
Also, we should still add an actual test run for this wrapper.
|
This PR was marked as stale because it has been open for 6 months with no activity. |
📝 WalkthroughWalkthroughSeveral new files were added under Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant Snakemake
participant Wrapper (wrapper.py)
participant Remote Server
User->>Snakemake: Launch workflow (Snakefile)
Snakemake->>Wrapper (wrapper.py): Run shell command with params (prefix, date, suffix)
Wrapper (wrapper.py)->>Remote Server: curl {prefix}{date}{suffix}
Remote Server-->>Wrapper (wrapper.py): Send file data
Wrapper (wrapper.py)-->>Snakemake: File saved locally
Snakemake-->>User: Workflow completes, output available
sequenceDiagram
participant User
participant Snakemake
participant Old Wrapper (old_wrapper.py)
participant Ensembl FTP
User->>Snakemake: Launch workflow (old_snakefile.smk/old_release.smk)
Snakemake->>Old Wrapper (old_wrapper.py): Pass parameters (species, release, build, etc.)
Old Wrapper (old_wrapper.py)->>Ensembl FTP: Check for file existence (curl)
alt File exists
Ensembl FTP-->>Old Wrapper (old_wrapper.py): File found
Old Wrapper (old_wrapper.py)->>Ensembl FTP: Download and decompress file
Ensembl FTP-->>Old Wrapper (old_wrapper.py): Send file data
Old Wrapper (old_wrapper.py)-->>Snakemake: File saved locally
else File not found
Old Wrapper (old_wrapper.py)-->>Snakemake: Print error, exit with failure
end
Snakemake-->>User: Workflow completes or fails with error
Tip ⚡️ Faster reviews with caching
Enjoy the performance boost—your workflow just got faster. ✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (9)
bio/reference/inphared-db/meta.yaml (1)
2-2: Fix trailing spaces and improve description format.The description has trailing spaces at the end of the line, which is flagged by the YAML linter. Also, consider using the Markdown link format as previously suggested.
-description: Download sequence file from the Inphared database (https://github.com/RyanCook94/inphared/blob/main/README.md), and store them in a single .fasta file. Please check the current database available at the above link and adjust the config file. +description: Download sequence file from the [inphared database](https://github.com/RyanCook94/inphared/blob/main/README.md), and store them in a single .fasta file. Please check the above link for available database version and adjust the config file.🧰 Tools
🪛 YAMLlint (1.35.1)
[error] 2-2: trailing spaces
(trailing-spaces)
bio/reference/inphared-db/wrapper.py (1)
8-9:⚠️ Potential issueFix syntax errors in shell command implementation.
The current implementation has syntax errors and doesn't properly use the Snakemake API. The shell command should use Python syntax, not Snakemake rule syntax.
Based on previous feedback, implement these changes:
- Use the
shell()function directly- Access parameters through the
snakemakeobject- Use f-strings for interpolation
- Use the output file specified by the Snakemake rule
- Add separator between URL components for clarity
- shell: - "curl {params.prefix}{params.date}{params.suffix} -o {params.date}{params.suffix}" +shell(f"curl {snakemake.params.url}/{snakemake.params.date}{snakemake.params.suffix} -o {snakemake.output}")Note: This assumes that the config.yaml has been updated to use
urlinstead ofprefix(without trailing slash) as suggested in previous reviews.🧰 Tools
🪛 Ruff (0.8.2)
8-8: SyntaxError: Unexpected indentation
8-9: SyntaxError: Expected an expression
9-9: SyntaxError: Unexpected indentation
🪛 GitHub Actions: Code quality
[error] 8-8: Black formatting check failed: Cannot parse file at line 8. File could not be formatted.
bio/reference/inphared-db/test/config.yaml (2)
8-9: Update prefix to url without trailing slash.As suggested in previous reviews, rename the prefix key to url and remove the trailing slash.
-prefix: - "https://millardlab-inphared.s3.climb.ac.uk/" +url: + "https://millardlab-inphared.s3.climb.ac.uk"
4-6: Consider using a smaller reference file for testing.The current reference file (_refseq_genomes.fa) may be large, which could be problematic for CI testing. As suggested in a previous review, find a smaller subset of viruses to use as a test reference.
Is there a smaller test file available from the inphared database that could be used for CI testing purposes? This would improve test execution times and reduce resource usage.
bio/reference/inphared-db/test/Snakefile (2)
5-5: Use arbitrary file names instead of direct config referencesOne of the best practices for these wrappers is to work with arbitrary file names. The
config[]entries should not be used directly in the output path in this example Snakefile, but rather in thewrapper.pyfile.Consider showing both approaches - one with a fixed filename and one with config values:
- output: - expand("{date}{suffix}", date=config["date"], suffix=config["suffix"]) + output: + "resources/inphared.fasta"Users can add Python code to incorporate config variables in their own workflow if needed.
7-7: Rename parameter from 'prefix' to 'url' for clarityRenaming this parameter would make its purpose clearer to users of the wrapper.
- prefix = config["prefix"], + url = config["url"],Remember to update the corresponding entry in the config.yaml file as well.
bio/reference/inphared-db/test/old_release.smk (1)
1-14: Remove this file if it's not needed for the inphared-db wrapperThis file appears to be unrelated to the inphared-db wrapper functionality and might be a copy-paste leftover as noted in previous review comments.
If this file is not needed for the inphared-db wrapper, it should be removed to prevent confusion.
bio/reference/inphared-db/test/old_snakefile.smk (1)
1-14: Remove this file if it's not needed for the inphared-db wrapperThis file appears to be unrelated to the inphared-db wrapper functionality and might be a leftover file as noted in previous review comments.
If this file is not needed for the inphared-db wrapper, it should be removed to prevent confusion.
bio/reference/inphared-db/old_wrapper.py (1)
1-5: Remove this file if it's not needed for the inphared-db wrapperThis file appears to be handling Ensembl downloads rather than inphared database downloads, which seems inconsistent with the PR title. Previous review comments suggested this file might be unnecessary.
If this file is not needed for the inphared-db wrapper, it should be removed to prevent confusion.
🧹 Nitpick comments (2)
bio/reference/inphared-db/old_wrapper.py (2)
8-8: Remove unused importThe
itertools.productimport is never used in this script.-from itertools import product🧰 Tools
🪛 Ruff (0.8.2)
8-8:
itertools.productimported but unusedRemove unused import:
itertools.product(F401)
47-52: Simplify nested if statementsThe nested if statements can be combined for better readability.
-if chromosome: - if not datatype == "dna": - raise ValueError( - "invalid datatype, to select a single chromosome the datatype must be dna" - ) +if chromosome and datatype != "dna": + raise ValueError( + "invalid datatype, to select a single chromosome the datatype must be dna" + )🧰 Tools
🪛 Ruff (0.8.2)
47-48: Use a single
ifstatement instead of nestedifstatementsCombine
ifstatements usingand(SIM102)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (8)
bio/reference/inphared-db/environment.yaml(1 hunks)bio/reference/inphared-db/meta.yaml(1 hunks)bio/reference/inphared-db/old_wrapper.py(1 hunks)bio/reference/inphared-db/test/Snakefile(1 hunks)bio/reference/inphared-db/test/config.yaml(1 hunks)bio/reference/inphared-db/test/old_release.smk(1 hunks)bio/reference/inphared-db/test/old_snakefile.smk(1 hunks)bio/reference/inphared-db/wrapper.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
`**/*.py`: Do not try to improve formatting. Do not suggest type annotations for functions that are defined inside of functions or methods. Do not suggest type annotation of the `s...
**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theselfargument of methods.
Do not suggest type annotation of theclsargument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturnstatement.
bio/reference/inphared-db/wrapper.pybio/reference/inphared-db/old_wrapper.py
`**/wrapper.py`: Do not complain about use of undefined variable called `snakemake`.
**/wrapper.py: Do not complain about use of undefined variable calledsnakemake.
bio/reference/inphared-db/wrapper.py
🪛 YAMLlint (1.35.1)
bio/reference/inphared-db/meta.yaml
[error] 2-2: trailing spaces
(trailing-spaces)
🪛 Ruff (0.8.2)
bio/reference/inphared-db/wrapper.py
8-8: SyntaxError: Unexpected indentation
8-9: SyntaxError: Expected an expression
9-9: SyntaxError: Unexpected indentation
bio/reference/inphared-db/old_wrapper.py
8-8: itertools.product imported but unused
Remove unused import: itertools.product
(F401)
11-11: Undefined name snakemake
(F821)
12-12: Undefined name snakemake
(F821)
13-13: Undefined name snakemake
(F821)
19-19: Undefined name snakemake
(F821)
20-20: Undefined name snakemake
(F821)
22-22: Undefined name snakemake
(F821)
29-29: Undefined name snakemake
(F821)
30-30: Undefined name snakemake
(F821)
47-48: Use a single if statement instead of nested if statements
Combine if statements using and
(SIM102)
🪛 GitHub Actions: Code quality
bio/reference/inphared-db/wrapper.py
[error] 8-8: Black formatting check failed: Cannot parse file at line 8. File could not be formatted.
bio/reference/inphared-db/test/Snakefile
[error] 1-1: WorkflowError: configfile 'config.yaml' is defined but not present or accessible at /home/runner/work/snakemake-wrappers/snakemake-wrappers/config.yaml.
⏰ Context from checks skipped due to timeout of 90000ms (1)
- GitHub Check: testing
🔇 Additional comments (2)
bio/reference/inphared-db/environment.yaml (1)
1-5: Environment configuration looks appropriate.The environment configuration correctly specifies conda-forge with nodefaults and includes the curl dependency, which is necessary for downloading from remote URLs.
bio/reference/inphared-db/test/config.yaml (1)
1-9: Verify if this configuration file is necessary.One of the PR comments questioned if this configuration file is truly necessary. Consider whether these parameters could be directly specified in the Snakefile rules or if this level of configuration provides valuable flexibility to users.
Could you clarify the purpose of having a separate config file rather than hardcoding these parameters in the wrapper or allowing them to be passed directly as rule parameters?
| @@ -0,0 +1,12 @@ | |||
| configfile: "config.yaml" | |||
There was a problem hiding this comment.
Fix the pipeline failure by ensuring config.yaml exists
The pipeline is failing because the config.yaml file is missing or inaccessible. Make sure the config file exists in the same directory as this Snakefile and has the correct permissions.
#!/bin/bash
# Check if config.yaml exists in the same directory as this Snakefile
if [ -f "bio/reference/inphared-db/test/config.yaml" ]; then
echo "config.yaml exists"
else
echo "config.yaml does not exist in the expected location"
fi🧰 Tools
🪛 GitHub Actions: Code quality
[error] 1-1: WorkflowError: configfile 'config.yaml' is defined but not present or accessible at /home/runner/work/snakemake-wrappers/snakemake-wrappers/config.yaml.
| release="75", | ||
| chromosome="I", | ||
| log: | ||
| "logs/get_genome.log", |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Use separate log files for different rules
Both get_genome and get_chromosome rules use the same log file, which could cause conflicts if both rules run concurrently.
- log:
- "logs/get_genome.log",
+ log:
+ "logs/get_chromosome.log",📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "logs/get_genome.log", | |
| log: | |
| "logs/get_chromosome.log", |
| release="98", | ||
| log: | ||
| "logs/get_genome.log", | ||
| cache: "mit-software" # save space and time with between workflow caching (see docs) |
There was a problem hiding this comment.
Fix typo in cache parameter
There appears to be a typo in the cache parameter. This rule uses "mit-software" while the other rule uses "omit-software".
- cache: "mit-software" # save space and time with between workflow caching (see docs)
+ cache: "omit-software" # save space and time with between workflow caching (see docs)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cache: "mit-software" # save space and time with between workflow caching (see docs) | |
| cache: "omit-software" # save space and time with between workflow caching (see docs) |
| chromosome="I", # optional: restrict to chromosome | ||
| # branch="plants", # optional: specify branch | ||
| log: | ||
| "logs/get_genome.log", |
There was a problem hiding this comment.
🛠️ Refactor suggestion
Use separate log files for different rules
Both get_genome and get_chromosome rules use the same log file, which could cause conflicts if both rules run concurrently.
- log:
- "logs/get_genome.log",
+ log:
+ "logs/get_chromosome.log",📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "logs/get_genome.log", | |
| log: | |
| "logs/get_chromosome.log", |
|
This PR was marked as stale because it has been open for 6 months with no activity. |
Description
Adding a new wrapper to download the current inphared database.
QC
For all wrappers added by this PR,
input:andoutput:file paths in the resulting rule can be changed arbitrarily,threads: xstatement withxbeing a reasonable default,map_readsfor a step that maps reads),environment.yamlspecifications follow the respective best practices,input:oroutput:),Snakefiles and their entries are explained via comments (input:/output:/params:etc.),stderrand/orstdoutare logged correctly (log:), depending on the wrapped tool,tempfile.gettempdir()points to (see here; this also means that using any Pythontempfiledefault behavior works),meta.yamlcontains a link to the documentation of the respective tool or command,Snakefiles pass the linting (snakemake --lint),Snakefiles are formatted with snakefmt,Summary by CodeRabbit