-
Notifications
You must be signed in to change notification settings - Fork 0
[Tileseq] Tileseq Best practices
This document describes best practices and example workflows for tileseq. For setup instructions regarding the tileseq pipeline, see here.
The tileseq pipeline uses the presence of technical replicates to calculate the amount of experimental error around a given measurement. By "technical replicates" we mean replicates that were split after the transformation/transfection stage but underwent separate tiling PCRs.
Without at least two technical replicates for each condition (including the WT-control!), the tileseq pipeline will be unable to calculate error and most filters will not work correctly. Ideally, we suggest using 3 or 4 technical replicates, as this unlocks the ability to use quorum filtering, i.e. the pipeline's ability to identify and disregard broken or bottlenecked replicates.
If you also have biological replicates (i.e. replicates that underwent separate transformations/transfections) you will want to treat these as separate sets of conditions in the parameter sheet which will result in separately scaled maps, which can then be merged at the very end using the mergeMaps script.
Each tile in your run should come with two or more WT control replicate samples, which serve to quantify sequencing error. If your screen is split across multiple sequencing runs, a WT control for each separate run should be included, as it is meant to control for systematic error in a given run.
Without multiple WT control replicates, systematic PCR/sequencing errors cannot be filtered out correctly!
Typically, in PopCode libraries with lambdas (average number of mutations per clone) around 2, variants usually occur at marginal frequencies of
With the results of the library QC in hand it is worth examining the distribution of marginal counts in the library. A visualization of the distribution can be found in the legend of the coverage map on the QC report. It typically resembles a bell curve. We want to pick a sequencing depth that can still capture the lower tail of the distribution. For example, let's assume we want to capture variants as rare as
Remember to factor in Illumina's quality filter, which can substantially reduce the number or useable reads. For low-complexity libraries on the NextSeq 500, this can sometimes be up to 50%, so it is good practice to aim for twice the depth you're hoping for.
Libraries with low complexity can result in higher error rates on Illumina machines, we thus recommend either spiking in higher amounts of $\Phi$X DNA and/or using a staggered primer design.
When designing your sample sheet, it is recommended that you use descriptive names for your samples instead of just giving them numbers. For example, name your sample Tile01NonselectRep1 instead of S102. This will make it a lot easier in the future for other people to go back to your data without having to decypher your numbering system.
After your run has completed, please store the run folder on rothseq1 at /export/rawseq/nextseq/ . In that same folder you should also find a README file. Please add the name and description of your run to the file, so we can identify it later.
When running bcl2fastq use the --no-lane-splitting flag to ensure that only one FASTQ file per sample (and read direction) will be generated.
Please archive your fastq files in the lab's common folder at /home/rothlab/common/seq_data/tileseq/
The parameter sheet defines the complete set of parameters needed by the computational workflow and also serves as documentation of the analysis. Inspired by Illumina's sample sheets, the parameter sheet is designed as a spreadsheet following a specific template. A full description of the parameter sheet can be found in the tileseqMave manual and the up-to-date template can be found here.
If you followed the advice about giving descriptive names to your sequencing samples earlier, this will make the job of editing the sample section in your parameter sheet a lot easier.
Export your spreadsheet to CSV format and then use the pipeline's csv2json converter to convert it into a JSON file and validate it. For example, if your csv file is called SUMO1_parameters.csv :
tsm csv2json SUMO1_parameters.csv -o SUMO_parameters.jsonIf there are any problems with your parameter sheet, they should be flagged here. The converted JSON file will be written to whichever filename you provide with the -o argument.
I recommend creating a workspace folder in your home directory, for example tileseq/. Inside of this folder you may want to place subdirectories for each of your tileseq projects. As an example, let's assume we're processing a tileseq run for the gene SUMO1. Let's create a project folder.
cd ~/tileseq/
mkdir SUMO1
cd SUMO1/Inside this folder we can now gather all the resources we need: The parameter sheet JSON file; and the FASTQ files. As the FASTQ files take a lot of disk space, I recommend placing a symbolic link instead of copying them.
#create symbolic link to FASTQ folder
ln -s /home/rothlab/common/seq_data/tileseq/SUMO1/SUMO1_FASTQ_joint FASTQ
#copy parameter sheet
cp ~/SUMO1_parameters.json .mutcount is the first part of the tileseq workflow. It aligns the tileseq reads against the template and then performs variant calling and counting steps on those alignments.
mutcount should be submitted as an HPC job to the cluster, as has long runtimes. If you have not yet familiarized yourself with the cluster submission guidelines, please read them now.
Instead of directly running the submission command, it is useful to create a short wrapper script instead. This serves as convenient documentation of exactly what was run. You can use an editor such as nano, vim, or emacs. If you are not familiar with any of these editor, I suggest nano is it is the most beginner-friendly. Here's a quick tutorial video on using nano.
Here's an example submission script:
#!/bin/bash
#this requests 2 cores, 4GB of RAM and 4 days of runtime
#redirect standard out and standard error streams to the file mutcount.log
submitjob.sh -n mutcount -c 2 -m 4G -t 4-00:00:00 -l mutcount.log -e mutcount.log -- \
tsm mutcount -p ~/tileseq/SUMO1/SUMO1_parameters.json \
-o ~/tileseq/SUMO1/ -f ~/tileseq/SUMO1/SUMO1_FASTQ_joint As you may have noticed, this script instructs the job to write outputs and errors directly to a log file, e.g. mutcount.log.
Let's save the file as runSUMO.sh. You can now execute it with bash runSUMO.sh. If your submission was successful, the job should now be listed in the Slurm queue: squeue -u $USER
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1855082 guest mutcount jweile R 00:00:01 1 galen61
You will also see that output is being written to the log file we specified. You can thus follow the progress live using
tail -f mutcount.log(To stop following the file, you can press CTRL-C)
The execution will proceed in multiple phases: (1) The FASTQ validation phase; (2) the alignment phase (3) the alignment validation phase; (4) the Q-score calibration phase; (5) the variant calling phase; and (6) the cleanup phase. During phases 2 and 5, mutcount will submit worker jobs to the cluster and then collate the results. Each worker job has its own submission script and log file.
You will also see that an output directory has been created, named according to the experiment title in the parameter sheet plu a timestamp. (For example SUMO1_2021-05-17-16-54-02). The directory may have the following contents:
calibrations/ -- A temporary directory containing intermediate data for the calibration step. Will be deleted during the cleanup step.
sam_files/ -- A directory containing the alignment outputs log files. The contents will be compressed into tarballs during the cleanup step.
ref/ -- A directory containing alignment reference libraries
<name>_mutcount/ -- The output directory for phase 2 (i.e. variant calling)
Inside the <name>_mutcount/ directory you should find the outputs the variant calling phase:
*_varcall.log -- Log files for the variant caller jobs
counts_sample_*.csv -- Variant counts for a given sample
coverage_*.csv -- Sequencing coverage information for a given sample
*_calibrate_phred.csv -- PHRED score calibration output for a given WT sample
This part of the analysis is performed using TileseqMave. Overall we will be performing the following steps:
- joinCounts : Compile the individual count files from all samples, normalize them, and calculate marginal frequencies.
- libraryQC : Generate PDF plots for QC report on the mutagenized library
- calcEnrichment: Calculate enrichment ratios (logPhi) across conditions for all variants, calculate measurement error and apply filters.
- selectionQC: Generate PDF plots for QC report on the selection assay.
- scaleScores: Re-scale enrichment ratios based on pivot points.
Detailed instructions on how to run all of these components can be found in the tileseqMave readme file
Some tips to make the running tileseqMave easier: Each module/script step in tileseqMave requires the parameter sheet and the workspace directory as an input. We can either supply the paths to these via the -p and -w parameters, or alternatively, we can make use of the fact that it there are built-in default assumptions. The default workspace is assumend to bhe the current working directory of the terminal (wherever we cd to). And the default parameter sheet is assumed to be called parameters.json inside the working directory. So instead of specifying tsm joinCounts -c 8 -p ~/tileseq/SUMO1/SUMO1_parameters.json -w ~/tileseq/SUMO1/SUMO1_2021-05-17-16-54-02/, we could make our life easier by running:
cd ~/tileseq/SUMO1/SUMO1_2021-05-17-16-54-02/
cp ../SUMO1_parameters.json parameters.json
tsm joinCounts -c 8After running runLibraryQC or runSelectionQC, a QC output folder will have been generated that contains a number of PDF files with plot visualizations. To consolidate all of these plots into fewer multi-page PDFs we can use the condenseQC script.
If the dataset you're analyzing is only a Library QC run, you will only need to run the first two components: joinCounts and runLibraryQC (and optionally condenseQC)
#!/bin/bash
cd ~/tileseq/SUMO1/SUMO1_2021-05-17-16-54-02/
cp ../SUMO1_parameters.json parameters.json
submitjob.sh -n tileseqMave -c 8 -m 16G -t 4-00:00:00 -l tsm.log -e tsm.log -- \
'tsm joinCounts -c 8 && tsm runLibraryQC -c 8 && tsm condenseQC'If the dataset is a full screen, then we'll need to run all steps up to the selection QC, after which we can inspect the QC results to select usefult pivot points.
#!/bin/bash
cd ~/tileseq/SUMO1/SUMO1_2021-05-17-16-54-02/
cp ../SUMO1_parameters.json parameters.json
submitjob.sh -n tileseqMave -c 8 -m 16G -t 4-00:00:00 -l tsm.log -e tsm.log -- \
'tsm joinCounts -c 8 && tsm runLibraryQC -c 8 && tsm calcEnrichment -c 8 && tsm runSelectionQC && tsm condenseQC'Just like the current version, legacy TilseqMut should be submitted as an HPC job to the cluster, as has long runtimes. If you have not yet familiarized yourself with the cluster submission guidelines, please read them now.
Instead of directly running the submission command, it is useful to create a short wrapper script instead. This serves as convenient documentation of exactly what was run. You can use an editor such as nano, vim, or emacs. If you are not familiar with any of these editor, I suggest nano is it is the most beginner-friendly. Here's a quick tutorial video on using nano.
Here's an example submission script:
#!/bin/bash
#this requests 8 cores, 16GB of RAM and 4 days of runtime
#redirect standard out and standard error streams to the file tileseq.log
submitjob.sh -n tileseqMut -c 8 -m 16G -t 4-00:00:00 -l tileseq.log -e tileseq.log -- \
tileseq_mut -p ~/tileseq/SUMO1/SUMO1_parameters.csv \
-o ~/tileseq/SUMO1/ -f ~/tileseq/SUMO1/SUMO1_FASTQ_joint \
--name SUMO1 --environment galen -c 6 --calibratePhredWT
As you may have noticed, this script instructs the job to write outputs and errors directly to a log file, e.g. tileseq.log.
Let's save the file as runSUMO.sh. You can now execute it with bash runSUMO.sh. If your submission was successful, the job should now be listed in the Slurm queue: squeue -u $USER
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1855082 guest tileseqMut jweile R 00:00:01 1 galen61
You will also see that output is being written to the log file we specified. You can thus follow the progress live using
tail -f tileseq.log(To stop following the file, you can press CTRL-C)
The execution will proceed in two phases: (1) The alignment phase; and (2) the variant calling phase. During each phase, tileseqMut will submit worker jobs to the cluster and then collate the results. Each worker job has its own submission script and log file.
You will also see that an output folder has been created, named according to the --name argument of our submission command and a timestamp. For example: SUMO1_2021-05-17-16-54-02. It will have the following contents:
args.log -- a log file that documents the submission parameters
GALEN_jobs -- A directory containing the alignment worker job scripts
main.log -- log messages from the phase 1 master process
mutcount_main.log -- log messages from the phase 2 master process
ref -- A directory containing alignment reference libraries
sam_files -- A directory containing the alignment output aligner log files.
<name>_mutcount/ -- The output directory for phase 2 (i.e. variant calling)
Inside the <name>_mutcount/ directory you will find the outputs of phase 2:
GALEN_jobs/ -- A directory containing the variant calling worker job scripts and log files
counts_sample_*.csv -- Variant counts for a given sample
coverage_*.csv -- Sequencing coverage information for a given sample
*_phred.log -- log file for PHRED score calibration on a given WT sample
*_calibrate_phred.csv -- PHRED score calibration output for a given WT sample
Sometimes, individual worker jobs fail. The most common reason is insufficient memory. You can diagnose this by inspecting the worker job log files.
If this happens for many jobs it may be worth re-running the entire run with increase memory requests for worker jobs, using the -mm parameter. Otherwise, you can just re-run the failed jobs individually with more memory by editing their submission scripts (in the appropriate GALEN_jobs folder).
To edit the file you can use nano again. Look for the line that says
#SBATCH --mem=25G
Increase the value (for example to 35G) and save the file. Keep in mind that not all our slurm nodes have that much memory. Here's an overview of how many of our nodes currently have how much RAM:
RAM(GB) count
24 15
56 22
64 10
128 6
256 2
386 8
After editing the script, you can resubmit the job using sbatch. For example:
sbatch SUMO1_2021-05-17-16-54-02/SUMO1_2021-05-20-13-17-07_mut_count/GALEN_jobs/Mut_count_18.sh