Skip to content

feat: Add mmseqs2 main workflows#3981

Merged
fgvieira merged 31 commits intosnakemake:masterfrom
fgvieira:mmseqs2
Feb 27, 2026
Merged

feat: Add mmseqs2 main workflows#3981
fgvieira merged 31 commits intosnakemake:masterfrom
fgvieira:mmseqs2

Conversation

@fgvieira
Copy link
Copy Markdown
Collaborator

@fgvieira fgvieira commented Apr 4, 2025

QC

While the contributions guidelines are more extensive, please particularly ensure that:

  • test.py was updated to call any added or updated example rules in a Snakefile
  • input: and output: file paths in the rules can be chosen arbitrarily
  • wherever possible, command line arguments are inferred and set automatically (e.g. based on file extensions in input: or output:)
  • temporary files are either written to a unique hidden folder in the working directory, or (better) stored where the Python function tempfile.gettempdir() points to
  • the meta.yaml contains a link to the documentation of the respective tool or command under url:
  • conda environments use a minimal amount of channels and packages, in recommended ordering

Summary by CodeRabbit

  • New Features

    • Added MMseqs2 wrappers, workflow rules, and DB creation support for search, clustering, linclust, taxonomy, and RBH; workflow metadata and parameters.
  • Tests

    • Added end-to-end tests, many expected outputs, and database creation fixtures.
  • Documentation

    • Added SILVA export README and workflow metadata for discoverability.
  • Chores

    • Added pinned and YAML Conda environment specs for reproducible installs.

@fgvieira fgvieira marked this pull request as draft April 4, 2025 13:38
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 4, 2025

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds MMseqs2 Snakemake wrappers (workflow + DB), metadata and Conda environment specs (YAML + linux-64 pin), test Snakefiles, a test runner, and many static test fixtures/expected outputs for DB creation, search, clustering, linclust, taxonomy, and RBH.

Changes

Cohort / File(s) Summary
Conda environments
bio/mmseqs2/db/environment.yaml, bio/mmseqs2/db/environment.linux-64.pin.txt, bio/mmseqs2/workflows/environment.yaml, bio/mmseqs2/workflows/environment.linux-64.pin.txt
Add environment YAMLs and explicit linux-64 pin files listing exact package URLs and pinned versions (mmseqs2, snakemake-wrapper-utils, runtime libs) for reproducible environments.
Metadata / Manifests
bio/mmseqs2/db/meta.yaml, bio/mmseqs2/workflows/meta.yaml
Add workflow and DB metadata declaring name, url, description, authors, I/O schema and params (module, extra).
Wrappers
bio/mmseqs2/db/wrapper.py, bio/mmseqs2/workflows/wrapper.py
New Snakemake wrappers that normalize inputs/outputs, assemble command-line args (module, extra, threads, tmpdir), special-case DB modules, and execute mmseqs2 via shell; workflow wrapper includes module-level metadata.
DB rules & inputs
bio/mmseqs2/db/test/Snakefile, bio/mmseqs2/db/test/seqs/a.fasta
Add test Snakefile with mmseqs2_databases and mmseqs2_createdb rules and small FASTA test input.
DB expected fixtures
bio/mmseqs2/db/test/expected/createdb/*, bio/mmseqs2/db/test/expected/databases/*
Add static expected outputs for createdb/databases (index, lookup, source, version README, _h.* files).
Workflow rules (tests)
bio/mmseqs2/workflows/test/Snakefile
Add test rules for search, cluster, linclust, taxonomy, and rbh with multiext outputs, logs, params and wrapper references.
Workflow DB fixtures
bio/mmseqs2/workflows/test/db/*
Add static DB fixture files (a.index, a.lookup, a.source, a_h.index, a_mapping).
Workflow expected outputs
bio/mmseqs2/workflows/test/expected/cluster/*, bio/mmseqs2/workflows/test/expected/linclust/*, bio/mmseqs2/workflows/test/expected/search/a.tab, bio/mmseqs2/workflows/test/expected/rbh/a.tab, bio/mmseqs2/workflows/test/expected/taxonomy/a_report
Add expected FASTA files, representative sequences, alignment/tab outputs and taxonomy report used by workflow tests.
Top-level tests
test_wrappers.py
Add test function test_mmseqs2(run) that executes both workflow and DB test suites and compares outputs against expected fixtures.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Rule as Snakemake Rule
  participant Wrapper as MMseqs2 Wrapper
  participant MM as mmseqs2 CLI
  participant FS as Filesystem

  Rule->>Wrapper: invoke(inputs, params(module, extra), threads, log)
  Note over Wrapper: normalize inputs/outputs\nresolve common prefixes\nconfigure tmpdir/threads/extra
  Wrapper->>MM: mmseqs2 <module> <query?> <target?> <output> --threads N <extra> (uses tmpdir)
  MM->>FS: read input files
  MM-->>FS: write outputs (DBs, tabs, FASTA, reports)
  MM-->>Wrapper: exit status
  Wrapper-->>Rule: write log, expose outputs
Loading
sequenceDiagram
  autonumber
  participant RuleDB as Snakemake DB Rule
  participant DBWrapper as MMseqs2 DB Wrapper
  participant MM as mmseqs2 CLI
  participant FS as Filesystem

  RuleDB->>DBWrapper: invoke(seqs input, params, threads, log)
  Note over DBWrapper: special-case modules:\n- databases: append thread flags\n- createdb: disable tmpdir
  DBWrapper->>MM: mmseqs2 <module> <in> <out> [--threads N] <extra>
  MM->>FS: read seqs
  MM-->>FS: emit DB artifacts (.index, .lookup, .source, _h.*)
  MM-->>DBWrapper: exit status
  DBWrapper-->>RuleDB: log and outputs
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • johanneskoester
🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically summarizes the main change: adding MMseqs2 main workflows to the repository, following conventional commit format.
Description check ✅ Passed The PR description is complete and follows the template, with all QC checklist items checked and confirmed by the contributor as completed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@fgvieira fgvieira marked this pull request as ready for review September 24, 2025 15:43
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (10)
bio/mmseqs2/db/test/seqs/a.fasta (1)

1-8: Confirm intentional duplicate FASTA IDs.

Headers “>1” appear twice. If this is deliberate to exercise lookup behavior, all good; otherwise consider making IDs unique to avoid tool-dependent collisions.

bio/mmseqs2/workflows/environment.linux-64.pin.txt (1)

1-44: Explicit pin looks consistent; verify CI can consume it as-is.

mmseqs2 and wrapper-utils are pinned; good for reproducibility. Please ensure CI uses conda/mamba versions that support @explicit with these URLs. Also note OpenSSL/libstdc++ versions differ from the db pin; that’s acceptable but consider aligning if unnecessary drift causes solver differences.

Note: Based on prior repo conventions I’ve learned, ensure environment.yaml (separate file) uses spaces around “=” in pins if updated.

bio/mmseqs2/db/test/expected/databases/a.version (1)

1-165: Verify licensing for bundled SILVA text; consider trimming the fixture.

This embeds substantial third‑party documentation text. Please confirm redistribution is permitted under its license. If the test doesn’t require full content, prefer a minimal synthetic placeholder or assert only on sentinel lines to avoid bundling large copyrighted text.

bio/mmseqs2/workflows/test/db/a.index (1)

1-4: Optional: de-duplicate identical fixtures.

Consider referencing a single canonical a.index fixture (or symlink) to avoid drift between workflow and db tests.

bio/mmseqs2/db/meta.yaml (1)

3-13: Polish description and I/O/param wording for clarity and consistency.

Capitalize the description; prefer “FASTA” over “FAS”; clarify “module” to reflect the MMseqs2 component being run.

 description: |
-  ultra fast and sensitive sequence search and clustering suite
+  Ultra-fast and sensitive sequence search and clustering suite
 input:
-  - input FAS file
+  - input FASTA file
 output:
-  - output: DB files
+  - output: MMseqs2 DB files
 params:
-  - module: workflow to use
+  - module: MMseqs2 module to run (e.g., createdb)
   - extra: additional program arguments
bio/mmseqs2/workflows/meta.yaml (1)

3-14: Clarify description and I/O wording; name “FASTA” explicitly; refine param help.

Minor copyedits for consistency with repo conventions.

 description: |
-  ultra fast and sensitive sequence search and clustering suite
+  Ultra-fast and sensitive sequence search and clustering suite
 input:
-  - query: input query FAS file(s)
-  - target: input target FAS file(s) or DB
+  - query: input query FASTA file(s)
+  - target: input target FASTA file(s) or MMseqs2 DB
 output:
-  - output: FAS, cluster or DB file(s)
+  - output: FASTA, cluster, or DB file(s)
 params:
-  - module: workflow to use
+  - module: MMseqs2 workflow to run (e.g., cluster, linclust, search)
   - extra: additional program arguments
bio/mmseqs2/workflows/environment.yaml (1)

5-7: Unpin snakemake-wrapper-utils; mmseqs2 17.b804f confirmed on Bioconda

Per repo convention pin only the main tool — keep mmseqs2 pinned, remove the snakemake-wrapper-utils pin.

 dependencies:
   - mmseqs2 =17.b804f
-  - snakemake-wrapper-utils =0.8.0
+  - snakemake-wrapper-utils
test_wrappers.py (1)

153-166: Remove duplicate key in compare_results_with_expected.

The key "out/cluster/a_b_cluster.tsv" appears twice; the second overwrites the first.

Apply this diff:

         compare_results_with_expected={
             "out/search/a.fas": "expected/search/a.fas",
             "out/cluster/a_b_cluster.tsv": "expected/cluster/a_b_cluster.tsv",
             "out/cluster/a_b_rep_seq.fasta": "expected/cluster/a_b_rep_seq.fasta",
             "out/cluster/a_b_all_seqs.fasta": "expected/cluster/a_b_all_seqs.fasta",
-            "out/cluster/a_b_cluster.tsv": "expected/cluster/a_b_cluster.tsv",
             "out/linclust/a_b_rep_seq.fasta": "expected/linclust/a_b_rep_seq.fasta",
             "out/linclust/a_b_all_seqs.fasta": "expected/linclust/a_b_all_seqs.fasta",
             "out/linclust/a_b_cluster.tsv": "expected/linclust/a_b_cluster.tsv",
             "out/taxonomy/a_tophit_report": "expected/taxonomy/a_tophit_report",
             "out/taxonomy/a_tophit_aln": "expected/taxonomy/a_tophit_aln",
             "out/taxonomy/a_report": "expected/taxonomy/a_report",
             "out/taxonomy/a_lca.tsv": "expected/taxonomy/a_lca.tsv",
             "out/rbh/a.fas": "expected/rbh/a.fas",
         },
bio/mmseqs2/db/wrapper.py (1)

23-31: Make module checks robust when params.module includes arguments.

params.module can contain arguments (e.g., "databases SILVA"), so equality checks miss "databases". Split before checking; also avoid passing a dummy tmpdir for modules not requiring it.

Apply this diff:

 with tempfile.TemporaryDirectory() as tmpdir:
-    # Modules with threads
-    if snakemake.params.module in ["databases"]:
-        extra = f"--threads {snakemake.threads} {extra}"
-    # Modules with no temp folder
-    if snakemake.params.module in ["createdb"]:
-        tmpdir = ""
-
-    shell("mmseqs {snakemake.params.module} {input} {out} {tmpdir} {extra} {log}")
+    module = snakemake.params.module
+    module_name = module.split()[0]
+    # Modules with threads
+    if module_name == "databases":
+        extra = f"--threads {snakemake.threads} {extra}"
+    # Modules with no temp folder
+    tmp_arg = "" if module_name == "createdb" else tmpdir
+
+    shell("mmseqs {module} {input} {out} {tmp_arg} {extra} {log}")
bio/mmseqs2/workflows/wrapper.py (1)

14-17: Use the local variable for target prefixing.

Minor clarity fix: use the already-resolved target when getting the common prefix.

Apply this diff:

 target = snakemake.input.get("target", "")
 if isinstance(target, list):
-    target = os.path.commonprefix(snakemake.input.target)
+    target = os.path.commonprefix(target)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 168b48b and 187892a.

⛔ Files ignored due to path filters (5)
  • bio/mmseqs2/workflows/test/expected/cluster/a_b_cluster.tsv is excluded by !**/*.tsv
  • bio/mmseqs2/workflows/test/expected/linclust/a_b_cluster.tsv is excluded by !**/*.tsv
  • bio/mmseqs2/workflows/test/expected/taxonomy/a_lca.tsv is excluded by !**/*.tsv
  • bio/mmseqs2/workflows/test/seqs/nucl.a.fas.gz is excluded by !**/*.gz
  • bio/mmseqs2/workflows/test/seqs/nucl.b.fas.gz is excluded by !**/*.gz
📒 Files selected for processing (30)
  • bio/mmseqs2/db/environment.linux-64.pin.txt (1 hunks)
  • bio/mmseqs2/db/environment.yaml (1 hunks)
  • bio/mmseqs2/db/meta.yaml (1 hunks)
  • bio/mmseqs2/db/test/Snakefile (1 hunks)
  • bio/mmseqs2/db/test/expected/createdb/a.index (1 hunks)
  • bio/mmseqs2/db/test/expected/createdb/a.lookup (1 hunks)
  • bio/mmseqs2/db/test/expected/createdb/a.source (1 hunks)
  • bio/mmseqs2/db/test/expected/createdb/a_h.index (1 hunks)
  • bio/mmseqs2/db/test/expected/databases/a.source (1 hunks)
  • bio/mmseqs2/db/test/expected/databases/a.version (1 hunks)
  • bio/mmseqs2/db/test/seqs/a.fasta (1 hunks)
  • bio/mmseqs2/db/wrapper.py (1 hunks)
  • bio/mmseqs2/workflows/environment.linux-64.pin.txt (1 hunks)
  • bio/mmseqs2/workflows/environment.yaml (1 hunks)
  • bio/mmseqs2/workflows/meta.yaml (1 hunks)
  • bio/mmseqs2/workflows/test/Snakefile (1 hunks)
  • bio/mmseqs2/workflows/test/db/a.index (1 hunks)
  • bio/mmseqs2/workflows/test/db/a.lookup (1 hunks)
  • bio/mmseqs2/workflows/test/db/a.source (1 hunks)
  • bio/mmseqs2/workflows/test/db/a_h.index (1 hunks)
  • bio/mmseqs2/workflows/test/db/a_mapping (1 hunks)
  • bio/mmseqs2/workflows/test/expected/cluster/a_b_all_seqs.fasta (1 hunks)
  • bio/mmseqs2/workflows/test/expected/cluster/a_b_rep_seq.fasta (1 hunks)
  • bio/mmseqs2/workflows/test/expected/linclust/a_b_all_seqs.fasta (1 hunks)
  • bio/mmseqs2/workflows/test/expected/linclust/a_b_rep_seq.fasta (1 hunks)
  • bio/mmseqs2/workflows/test/expected/rbh/a.tab (1 hunks)
  • bio/mmseqs2/workflows/test/expected/search/a.tab (1 hunks)
  • bio/mmseqs2/workflows/test/expected/taxonomy/a_report (1 hunks)
  • bio/mmseqs2/workflows/wrapper.py (1 hunks)
  • test_wrappers.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

⚙️ CodeRabbit configuration file

**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

Files:

  • test_wrappers.py
  • bio/mmseqs2/db/wrapper.py
  • bio/mmseqs2/workflows/wrapper.py
**/wrapper.py

⚙️ CodeRabbit configuration file

Do not complain about use of undefined variable called snakemake.

Files:

  • bio/mmseqs2/db/wrapper.py
  • bio/mmseqs2/workflows/wrapper.py
🧠 Learnings (8)
📚 Learning: 2025-04-17T09:24:51.738Z
Learnt from: dlaehnemann
PR: snakemake/snakemake-wrappers#0
File: :0-0
Timestamp: 2025-04-17T09:24:51.738Z
Learning: In snakemake-wrappers repository, environment.yaml files should follow these conventions:
1. Use whitespace before the equal sign in version specifications (e.g., "datavzrd =2.53.1")
2. Only specify the exact version for the main software package
3. Don't add version constraints for dependencies unless absolutely necessary
4. See full guidelines at: https://snakemake-wrappers.readthedocs.io/en/stable/contributing.html#environment-yaml-file

Applied to files:

  • bio/mmseqs2/workflows/environment.yaml
  • bio/mmseqs2/db/environment.yaml
📚 Learning: 2025-04-17T09:24:51.738Z
Learnt from: dlaehnemann
PR: snakemake/snakemake-wrappers#0
File: :0-0
Timestamp: 2025-04-17T09:24:51.738Z
Learning: In snakemake-wrappers repository, environment.yaml files should follow these conventions:
1. Use whitespace before the equal sign in version specifications (e.g., "datavzrd =2.53.1")
2. Only specify the exact version for the main software package
3. Don't add version constraints for dependencies unless absolutely necessary
4. See guidelines at: https://snakemake-wrappers.readthedocs.io/en/stable/contributing.html#environment-yaml-file

Applied to files:

  • bio/mmseqs2/workflows/environment.yaml
📚 Learning: 2025-02-11T12:24:22.592Z
Learnt from: dlaehnemann
PR: snakemake/snakemake-wrappers#3648
File: bio/nanosim/simulator/environment.yaml:6-6
Timestamp: 2025-02-11T12:24:22.592Z
Learning: In the nanosim bioconda recipe, dependencies are carefully managed with specific version pins (e.g., scikit-learn ~=0.23.2) to ensure compatibility with pre-trained models. These dependencies don't need to be explicitly added to environment.yaml files when the main package is listed as a dependency, as they are handled through the bioconda recipe system.

Applied to files:

  • bio/mmseqs2/workflows/environment.yaml
  • bio/mmseqs2/db/environment.linux-64.pin.txt
  • bio/mmseqs2/workflows/environment.linux-64.pin.txt
📚 Learning: 2025-06-02T07:56:35.854Z
Learnt from: tdayris
PR: snakemake/snakemake-wrappers#4159
File: bio/pyfaidx/environment.yaml:6-6
Timestamp: 2025-06-02T07:56:35.854Z
Learning: In the Snakemake-wrapper repository, conda dependency version pinning in environment.yaml files uses spaces around the equals sign (e.g., `- pyfaidx =0.8.1.4`) as the established coding standard, even though conda itself doesn't require the spaces.

Applied to files:

  • bio/mmseqs2/workflows/environment.linux-64.pin.txt
📚 Learning: 2024-11-26T08:31:00.099Z
Learnt from: tdayris
PR: snakemake/snakemake-wrappers#3496
File: bio/mtnucratio/test/Snakefile:2-6
Timestamp: 2024-11-26T08:31:00.099Z
Learning: In test files for Snakemake wrappers, such as `bio/mtnucratio/test/Snakefile`, hard-coded input and output paths are acceptable as examples and do not need to use wildcards to make paths flexible.

Applied to files:

  • test_wrappers.py
📚 Learning: 2024-08-21T08:30:42.757Z
Learnt from: johanneskoester
PR: snakemake/snakemake-wrappers#3123
File: utils/datavzrd/wrapper.py:31-32
Timestamp: 2024-08-21T08:30:42.757Z
Learning: In `wrapper.py` scripts, do not flag the use of an undefined variable called `snakemake`.

Applied to files:

  • bio/mmseqs2/db/wrapper.py
  • bio/mmseqs2/workflows/wrapper.py
📚 Learning: 2024-11-21T10:23:03.427Z
Learnt from: johanneskoester
PR: snakemake/snakemake-wrappers#3478
File: bio/varlociraptor/estimate-alignment-properties/wrapper.py:5-12
Timestamp: 2024-11-21T10:23:03.427Z
Learning: In the Snakemake wrappers project, avoid suggesting extensive error handling or temporary file management in simple wrapper scripts when it may be unnecessary, to prevent overcomplicating the code.

Applied to files:

  • bio/mmseqs2/db/wrapper.py
  • bio/mmseqs2/workflows/wrapper.py
📚 Learning: 2024-10-08T17:41:54.542Z
Learnt from: johanneskoester
PR: snakemake/snakemake-wrappers#3123
File: utils/datavzrd/wrapper.py:31-32
Timestamp: 2024-10-08T17:41:54.542Z
Learning: The `snakemake` variable is inserted via a preamble during execution in `wrapper.py` scripts, so it doesn't need to be explicitly defined.

Applied to files:

  • bio/mmseqs2/workflows/wrapper.py
🪛 Ruff (0.13.1)
test_wrappers.py

157-157: Dictionary key literal "out/cluster/a_b_cluster.tsv" repeated

Remove repeated key literal "out/cluster/a_b_cluster.tsv"

(F601)

bio/mmseqs2/db/wrapper.py

9-9: Undefined name snakemake

(F821)


10-10: Undefined name snakemake

(F821)


13-13: Undefined name snakemake

(F821)


18-18: Undefined name snakemake

(F821)


25-25: Undefined name snakemake

(F821)


26-26: Undefined name snakemake

(F821)


28-28: Undefined name snakemake

(F821)

bio/mmseqs2/workflows/wrapper.py

10-10: Undefined name snakemake

(F821)


11-11: Undefined name snakemake

(F821)


14-14: Undefined name snakemake

(F821)


16-16: Undefined name snakemake

(F821)


19-19: Undefined name snakemake

(F821)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: testing
🔇 Additional comments (31)
bio/mmseqs2/db/test/expected/createdb/a.lookup (1)

1-4: LGTM; please confirm column semantics are stable across OS/versions.

The mapping aligns with the duplicated IDs in the input FASTA. Please confirm the 3rd column (“0”) is the expected source/file index from mmseqs2 createdb and is stable across platforms.

bio/mmseqs2/workflows/test/expected/linclust/a_b_rep_seq.fasta (1)

1-4: LGTM.

Representative sequences look fine for the linclust expected output.

bio/mmseqs2/db/environment.linux-64.pin.txt (1)

1-33: LGTM; reproducible DB env pin.

Pins are minimal and include only what’s needed (mmseqs2, gawk, basics). Confirm CI path uses conda create --file to avoid accidental channel mixing.

bio/mmseqs2/workflows/test/expected/taxonomy/a_report (1)

1-1: LGTM.

Report format/fields look consistent for the taxonomy workflow expected output.

bio/mmseqs2/db/test/expected/createdb/a.source (1)

1-1: LGTM; confirm path semantics.

The source mapping “0a.fasta” matches common mmseqs2 conventions. Please confirm tests assume only basenames (not full paths) to keep runs stable across machines.

bio/mmseqs2/workflows/test/expected/rbh/a.tab (1)

1-1: Confirm column order vs mmseqs2 rbh --format-output.

RBH fields can differ by version/flags; please verify the wrapper’s format-output matches this expectation to avoid brittle tests.

bio/mmseqs2/db/test/expected/databases/a.source (1)

1-1: LGTM: expected database source mapping.

Content and tab delimiter look correct.

bio/mmseqs2/workflows/test/db/a.index (1)

1-4: LGTM: index fixture matches createdb expectation.

Looks consistent with the createdb/a.index expectation.

bio/mmseqs2/workflows/test/db/a_mapping (1)

1-5: LGTM: taxonomy mapping fixture.

Values and formatting look fine.

bio/mmseqs2/db/environment.yaml (1)

1-6: LGTM: environment.yaml follows repo conventions.

Channels ordering, nodefaults, and single pinned main dependency all align with guidelines.

If mmseqs2 version changes later, remember to refresh the corresponding pinned lockfile(s) to keep CI deterministic.

bio/mmseqs2/db/test/expected/createdb/a.index (1)

1-4: LGTM: createdb index expectation.

Matches the workflow fixture; tab-separated triplets are consistent.

bio/mmseqs2/workflows/test/db/a_h.index (1)

1-4: LGTM: header index fixture.

Consistent increments; formatting OK.

bio/mmseqs2/workflows/test/db/a.source (1)

1-1: LGTM: source mapping.

Tab-delimited mapping looks correct.

bio/mmseqs2/workflows/test/db/a.lookup (1)

1-4: LGTM: fixture content looks consistent.

Tab-delimited triplets align with the corresponding createdb expectations.

bio/mmseqs2/db/test/expected/createdb/a_h.index (1)

1-4: LGTM: expected index triplets look plausible.

Monotonic offsets with fixed block size are consistent.

bio/mmseqs2/workflows/test/expected/search/a.tab (1)

1-37: LGTM: expected TSV layout is coherent.

Columns/values look consistent with typical MMseqs2 search output.

bio/mmseqs2/workflows/test/expected/linclust/a_b_all_seqs.fasta (1)

1-6: Incorrect — no consecutive FASTA headers found. Repository check under bio/mmseqs2/workflows/test/expected found no consecutive '>' lines; the duplicate header-only records reported are not present.

Likely an incorrect or invalid review comment.

test_wrappers.py (3)

137-151: LGTM: adds end-to-end tests for mmseqs2 workflows.


153-166: Sanity-check side-effect outputs used for comparisons.

"out/search/a.fas" and "out/rbh/a.fas" are not declared outputs; they rely on mmseqs2 side effects. Please confirm these filenames are stable across mmseqs2 versions (to avoid flaky tests).


170-194: LGTM: db tests cover both 'databases' and 'createdb' artifacts.

bio/mmseqs2/db/wrapper.py (1)

13-21: Prefix derivation looks good.

Using commonprefix and rstrip("_") to compute database/output base paths matches the multiext patterns used in tests.

bio/mmseqs2/db/test/Snakefile (2)

1-25: LGTM: sensible outputs and logging for 'databases'.


27-50: LGTM: createdb rule aligns with wrapper contract.

bio/mmseqs2/workflows/wrapper.py (2)

19-30: LGTM: output base handling and format-mode toggles.


31-34: LGTM: command assembly with threads and tmpdir.

bio/mmseqs2/workflows/test/expected/cluster/a_b_rep_seq.fasta (1)

1-4: LGTM: expected FASTA fixture added.

bio/mmseqs2/workflows/test/Snakefile (5)

1-15: LGTM: easy-search rule wiring.


17-32: LGTM: easy-cluster outputs via multiext with stable base.


34-49: LGTM: easy-linclust mirrors cluster interface.


51-81: LGTM: taxonomy target DB multiext matches db wrapper outputs.


83-97: LGTM: RBH rule consistent with search interface.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (2)
bio/mmseqs2/workflows/wrapper.py (2)

34-37: Quote paths to avoid word splitting; keep optional target unquoted.

Prevents breakage when paths contain spaces, while not forcing an empty target argument.

Apply:

-    shell(
-        "mmseqs {snakemake.params.module} {snakemake.input.query} {target} {out} {tmpdir} --threads {snakemake.threads} {extra} {log}"
-    )
+    shell(
+        'mmseqs {snakemake.params.module} "{snakemake.input.query}" {target} "{out}" "{tmpdir}" --threads {snakemake.threads} {extra} {log}'
+    )

18-18: Track the TODO for arbitrary output names.

If needed, I can help implement support and adjust tests.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 187892a and 2a4ab88.

📒 Files selected for processing (1)
  • bio/mmseqs2/workflows/wrapper.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

⚙️ CodeRabbit configuration file

**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

Files:

  • bio/mmseqs2/workflows/wrapper.py
**/wrapper.py

⚙️ CodeRabbit configuration file

Do not complain about use of undefined variable called snakemake.

Files:

  • bio/mmseqs2/workflows/wrapper.py
🧠 Learnings (2)
📚 Learning: 2024-11-21T10:23:03.427Z
Learnt from: johanneskoester
PR: snakemake/snakemake-wrappers#3478
File: bio/varlociraptor/estimate-alignment-properties/wrapper.py:5-12
Timestamp: 2024-11-21T10:23:03.427Z
Learning: In the Snakemake wrappers project, avoid suggesting extensive error handling or temporary file management in simple wrapper scripts when it may be unnecessary, to prevent overcomplicating the code.

Applied to files:

  • bio/mmseqs2/workflows/wrapper.py
📚 Learning: 2024-08-21T08:30:42.757Z
Learnt from: johanneskoester
PR: snakemake/snakemake-wrappers#3123
File: utils/datavzrd/wrapper.py:31-32
Timestamp: 2024-08-21T08:30:42.757Z
Learning: In `wrapper.py` scripts, do not flag the use of an undefined variable called `snakemake`.

Applied to files:

  • bio/mmseqs2/workflows/wrapper.py
🪛 Ruff (0.13.1)
bio/mmseqs2/workflows/wrapper.py

10-10: Undefined name snakemake

(F821)


11-11: Undefined name snakemake

(F821)


14-14: Undefined name snakemake

(F821)


16-16: Undefined name snakemake

(F821)


19-19: Undefined name snakemake

(F821)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: testing
  • GitHub Check: docs
🔇 Additional comments (1)
bio/mmseqs2/workflows/wrapper.py (1)

1-3: Module metadata looks good.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
test_wrappers.py (2)

152-160: Remove duplicate dict key in compare_results_with_expected (ruff F601).

The key "out/cluster/a_b_cluster.tsv" appears twice; the latter silently overrides the former. Drop the duplicate to satisfy linting and avoid confusion.

Apply this diff:

             "out/cluster/a_b_rep_seq.fasta": "expected/cluster/a_b_rep_seq.fasta",
             "out/cluster/a_b_all_seqs.fasta": "expected/cluster/a_b_all_seqs.fasta",
-            "out/cluster/a_b_cluster.tsv": "expected/cluster/a_b_cluster.tsv",
             "out/linclust/a_b_rep_seq.fasta": "expected/linclust/a_b_rep_seq.fasta",

137-196: Consider splitting into two tests for clearer failures.

Having workflows and db assertions in one test makes triage harder when one half fails. Suggest test_mmseqs2_workflows() and test_mmseqs2_db().

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2a4ab88 and 11e8805.

📒 Files selected for processing (1)
  • test_wrappers.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

⚙️ CodeRabbit configuration file

**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

Files:

  • test_wrappers.py
🪛 Ruff (0.13.1)
test_wrappers.py

157-157: Dictionary key literal "out/cluster/a_b_cluster.tsv" repeated

Remove repeated key literal "out/cluster/a_b_cluster.tsv"

(F601)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: testing
  • GitHub Check: docs
🔇 Additional comments (1)
test_wrappers.py (1)

152-166: Verify expected fixture files exist
In test_wrappers.py (around lines 152–166 and 181–193), before comparing outputs, iterate over compare_results_with_expected.values() and assert each expected file exists to avoid brittle CI failures. For example:

for expected_path in compare_results_with_expected.values():
    assert os.path.isfile(expected_path), f"Missing fixture: {expected_path}"

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 11e8805 and 949d65d.

📒 Files selected for processing (1)
  • test_wrappers.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

⚙️ CodeRabbit configuration file

**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

Files:

  • test_wrappers.py
🪛 Ruff (0.14.1)
test_wrappers.py

168-168: Dictionary key literal "out/cluster/a_b_cluster.tsv" repeated

Remove repeated key literal "out/cluster/a_b_cluster.tsv"

(F601)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: docs
  • GitHub Check: testing

fgvieira and others added 4 commits October 27, 2025 13:06
Copy link
Copy Markdown
Member

@dlaehnemann dlaehnemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this is a big one. Nice job in getting all of the test cases set up, even including expected files.

I have some things I would suggest now, and would then have another look once they are addressed. The most important points are:

  1. Actually implement the TODO for arbitrary output files. I think this is feasible and sooo much better for using the wrappers in workflows. I made a suggestion in the db wrapper.
  2. Actually check that the specified params: module: is a valid choice and throw a useful error otherwise. I made a suggestion in the db wrapper, but the the error message could maybe be even more detailed, maybe with a linkout to the docs.
  3. The meta.yaml files should list all of the available modules and link out to respective sections in the documentation. This makes it so much easier to find the right place to look for how to use a module.
  4. Maybe directly update to the latest mmseqs2 release available on bioconda.

@fgvieira fgvieira requested a review from dlaehnemann February 25, 2026 17:42
Copy link
Copy Markdown
Member

@dlaehnemann dlaehnemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor points. So feel free to merge without another review, once you have gone through them.

fgvieira and others added 5 commits February 27, 2026 13:46
@fgvieira fgvieira enabled auto-merge (squash) February 27, 2026 14:06
@fgvieira fgvieira merged commit d97a2a8 into snakemake:master Feb 27, 2026
8 checks passed
johanneskoester pushed a commit that referenced this pull request Mar 3, 2026
🤖 I have created a release *beep* *boop*
---


##
[9.3.0](v9.2.0...v9.3.0)
(2026-02-27)


### Features

* Add mmseqs2 main workflows
([#3981](#3981))
([d97a2a8](d97a2a8))


### Performance Improvements

* autobump bio/alignoth/environment.yaml
([#4993](#4993))
([6894aa1](6894aa1))
* autobump bio/bbtools/environment.yaml
([#4995](#4995))
([04129d2](04129d2))
* autobump bio/bismark/bam2nuc/environment.yaml
([#4996](#4996))
([2f846f7](2f846f7))
* autobump bio/bismark/bismark_genome_preparation/environment.yaml
([#5004](#5004))
([693441d](693441d))
* autobump bio/bismark/bismark_methylation_extractor/environment.yaml
([#5000](#5000))
([b213a05](b213a05))
* autobump bio/bismark/bismark/environment.yaml
([#4997](#4997))
([e3c8b09](e3c8b09))
* autobump bio/bismark/bismark2bedGraph/environment.yaml
([#5001](#5001))
([117fe2b](117fe2b))
* autobump bio/bismark/bismark2report/environment.yaml
([#4999](#4999))
([4f17be2](4f17be2))
* autobump bio/bismark/bismark2summary/environment.yaml
([#5005](#5005))
([3f95f0c](3f95f0c))
* autobump bio/bismark/deduplicate_bismark/environment.yaml
([#4998](#4998))
([ad4d0af](ad4d0af))
* autobump bio/bowtie2/align/environment.yaml
([#5003](#5003))
([47cd070](47cd070))
* autobump bio/bowtie2/build/environment.yaml
([#5002](#5002))
([1ef441f](1ef441f))
* autobump bio/deseq2/deseqdataset/environment.yaml
([#5006](#5006))
([9c7740c](9c7740c))
* autobump bio/mosdepth/environment.yaml
([#5009](#5009))
([eb9b820](eb9b820))
* autobump bio/trim_galore/pe/environment.yaml
([#5014](#5014))
([1747f10](1747f10))
* autobump bio/trim_galore/se/environment.yaml
([#5016](#5016))
([e8e2ff9](e8e2ff9))
* autobump bio/tximport/environment.yaml
([#5015](#5015))
([4432c52](4432c52))
* autobump phys/root/define_columns/environment.yaml
([#5013](#5013))
([0c34711](0c34711))
* autobump phys/root/filter/environment.yaml
([#5011](#5011))
([b271a89](b271a89))
* autobump phys/root/hadd/environment.yaml
([#5010](#5010))
([68c03a1](68c03a1))
* autobump phys/root/rootcp/environment.yaml
([#5012](#5012))
([1893e02](1893e02))
* autobump utils/datavzrd/environment.yaml
([#5007](#5007))
([50d0e18](50d0e18))
* autobump utils/miller/environment.yaml
([#5008](#5008))
([d0e29f7](d0e29f7))

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).
@fgvieira fgvieira deleted the mmseqs2 branch March 18, 2026 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants