Add propagate to predict prophage activity #110

CarsonJM · 2024-04-18T20:46:19Z

PR checklist

… propagate

CarsonJM · 2024-04-22T22:29:48Z

Move the test data to nf-core/test-datasets

adamrtalbot

Nothing deal breaking:

avoid loops for creating pandas DataFrames, it causes exponential memory increase for bigger samples
Got a few Groovy things within map statements that I think can be done better. They might cause problems with crossed memory stuff.
The Nextflow itself is good.

I think these are improvements but none are blocking.

adamrtalbot · 2024-04-26T13:18:52Z

bin/derep_coordinates.py

This feels like a big overhead for what it's doing. Is it just checking the FASTA header is in the TSV? We should be able to streamline this...

adamrtalbot · 2024-04-26T13:36:31Z

bin/extract_proviruses.py

+    # identify checkv provirus coordinates
+    checkv_coords = pd.DataFrame()
+    for index, row in checkv_proviruses.iterrows():
+        contig_id = row['contig_id']
+        region_types = row['region_types'].split(',')
+        provirus_count = 0
+        # parse though regions for each contig
+        for i in range(len(region_types)):
+            if region_types[i] == 'viral':
+                # if a region is viral, extract contig name, assign a provirus id, and add checkv start/end coords
+                provirus_info = pd.DataFrame()
+                provirus_count += 1
+                provirus_info['seq_name'] = [contig_id]
+                provirus_info['provirus_id'] = [contig_id + '|checkv_provirus_' + str(provirus_count)]
+                provirus_info[['provirus_start', 'provirus_stop']] = [row['region_coords_bp'].split(',')[i].split('-')]
+                checkv_coords = pd.concat([checkv_coords, provirus_info], axis=0)


This is a fairly inefficient way of creating a dataframe. Better to create a list of DataFrames then concat once:

pd.concat([checkProvirusCoords(inputs) for inputs in checkv_proviruses])

Not a blocker because I doubt this is slow - but worth noting.

adamrtalbot · 2024-04-26T13:51:53Z

bin/extract_proviruses.py

+            else:
+                provirus_coords['fragment'] = row['seq_name'] + '|provirus_' + str(provirus_coords['start'][0]) + '_' + str(provirus_coords['stop'][0])
+            # concatenate all provirus coordinates
+            provirus_combined_coords = pd.concat([provirus_combined_coords, provirus_coords], axis=0)


adamrtalbot · 2024-04-26T13:54:13Z

subworkflows/local/fasta_cluster_blast/tests/main.nf.test



-    test("fasta.gz") {
+    test("fasta.gz + 95 + 0 + 85") {


Could you use a more clear name here, like params?

Suggested change

test("fasta.gz + 95 + 0 + 85") {

test("fasta.gz, anicluster_min_ani = 95, anicluster_min_qcov = 0, anicluster_min_tcov = 85") {

subworkflows/local/fastq_fasta_provirus_activity_propagate/main.nf

adamrtalbot · 2024-04-26T14:33:53Z

bin/propagate.py

+        except IndexError:
+            pass
+
+        sys.stderr.write("\nError: -v coordinates file is formatted incorrectly. See README for details. Exiting.\n")


Using logging would be more appropriate than sys.stderr

Or if it's an error, raise YourException("message")

adamrtalbot · 2024-04-26T14:37:13Z

bin/propagate.py

+if any(exist):
+    exit(1)
+
+vibe_header = prophages_check(vibe)
+if not vibe_header:
+    exit(1)
+
+# verify inputs
+check = [samfile, bamfile, forward, interleaved, unpaired]
+check = [c for c in check if c != '']
+if len(check) > 1 or not check:
+    sys.stderr.write(f"\nOnly one input file (-s, -b, -r, -i, -u) is allowed. {len(check)} provided. Exiting.\n")
+    exit(1)
+
+if forward and reverse:
+    if not forward.endswith('.fastq') and not forward.endswith('.fastq.gz'):
+        sys.stderr.write("\nError: Provided paired reads files must both have the extension .fastq or .fastq.gz. Exiting.\n")
+        sys.stderr.write(f"{forward}\n")
+        exit(1)
+    if not reverse.endswith('.fastq') and not reverse.endswith('.fastq.gz'):
+        sys.stderr.write("\nError: Provided paired reads files must both have the extension .fastq or .fastq.gz. Exiting.\n")
+        sys.stderr.write(f"{reverse}\n\n")
+        exit(1)
+if interleaved:
+    if not interleaved.endswith('.fastq') and not interleaved.endswith('.fastq.gz'):
+        sys.stderr.write("\nError: Provided interleaved reads file must have the extension .fastq or .fastq.gz. Exiting.\n")
+        sys.stderr.write(f"{interleaved}\n\n")
+        exit(1)
+if unpaired:
+    if not unpaired.endswith('.fastq') and not unpaired.endswith('.fastq.gz'):
+        sys.stderr.write("\nError: Provided unpaired reads file must have the extension .fastq or .fastq.gz. Exiting.\n")
+        sys.stderr.write(f"{unpaired}\n\n")
+        exit(1)
+
+if samfile:
+    if not_exist(samfile, 'sam file'):
+        exit(1)
+    if not bamfile.endswith('.sam'):
+        sys.stderr.write("\nError: Provided sam file must have the extension .sam. Exiting.\n")
+        exit(1)
+if bamfile:
+    if not_exist(bamfile, 'bam file'):
+        exit(1)
+    if not bamfile.endswith('.bam'):
+        sys.stderr.write("\nError: Provided bam file must have the extension .bam. Exiting.\n")
+        exit(1)
+
+if effect < 0.6:
+    sys.stderr.write("\nError: Cohen's d effect size (-e) should not be set below 0.6. Exiting.\n")
+    exit(1)
+if min_breadth > 1:
+    sys.stderr.write("\nError: breadth (--breadth) should be a decimal value <= 1. Exiting.\n")
+    exit(1)
+if ratio_cutoff < 1.5:
+    sys.stderr.write("\nError: ratio cutoff (-c) should not be set below 1.5. Exiting.\n")
+    exit(1)
+if read_id > 1:
+    sys.stderr.write("\nError: percent identity (-p) should be a decimal value <= 1. Exiting.\n")
+    exit(1)
+read_id = 1.0 - read_id


Better to raise an error instead of exit, it's more robust and better to handle. exit can cause the process to exit prematurely.

CarsonJM and others added 12 commits March 22, 2024 21:55

Started implementing propagate

6bd49dc

Merge branch 'dev' of https://github.com/CarsonJM/phageannotator into…

e0fe7a7

… propagate

Propagate subworkflow running

9c90153

Merge branch 'dev' into propagate

7c0c30f

Started clean up of workflow and propagate files

853d590

all propagate module tests completed

705e975

Propagate subworkflow test implemented

d1284a1

Updated snapshots to include propagate output

c8f2123

Removed https proxy code

63338b9

Updated schema to include propagate params

b329576

Updated tests for changed clustering modules/subworkflows

2de7602

Fixed propagate log file snapshot

ba008c4

CarsonJM added 2 commits April 24, 2024 16:43

Replaced local test datasets with nf-core/test-datasets

59e53ff

Fixed path to test data

9b5f6cb

adamrtalbot approved these changes Apr 26, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add propagate to predict prophage activity #110

Add propagate to predict prophage activity #110

Uh oh!

CarsonJM commented Apr 18, 2024

Uh oh!

CarsonJM commented Apr 22, 2024

Uh oh!

adamrtalbot left a comment

Uh oh!

adamrtalbot Apr 26, 2024

Uh oh!

adamrtalbot Apr 26, 2024

Uh oh!

adamrtalbot Apr 26, 2024

Uh oh!

adamrtalbot Apr 26, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adamrtalbot Apr 26, 2024

Uh oh!

adamrtalbot Apr 26, 2024

Uh oh!

adamrtalbot Apr 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	test("fasta.gz + 95 + 0 + 85") {
	test("fasta.gz, anicluster_min_ani = 95, anicluster_min_qcov = 0, anicluster_min_tcov = 85") {

Uh oh!

Add propagate to predict prophage activity #110

Are you sure you want to change the base?

Add propagate to predict prophage activity #110

Uh oh!

Conversation

CarsonJM commented Apr 18, 2024

PR checklist

Uh oh!

CarsonJM commented Apr 22, 2024

Uh oh!

adamrtalbot left a comment

Choose a reason for hiding this comment

Uh oh!

adamrtalbot Apr 26, 2024

Choose a reason for hiding this comment

Uh oh!

adamrtalbot Apr 26, 2024

Choose a reason for hiding this comment

Uh oh!

adamrtalbot Apr 26, 2024

Choose a reason for hiding this comment

Uh oh!

adamrtalbot Apr 26, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

adamrtalbot Apr 26, 2024

Choose a reason for hiding this comment

Uh oh!

adamrtalbot Apr 26, 2024

Choose a reason for hiding this comment

Uh oh!

adamrtalbot Apr 26, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants