-
Notifications
You must be signed in to change notification settings - Fork 0
[Barseq] Analyzing Barseq data
Even though we use PacBio Hifi runs, we ideally request the raw CLR (continuous-long-read) data for each run, as this enables us to run DeepConsensus for improved circular consensus read quality. If you don't have the raw CLR data, you'll have to contend with the default CCS (circular consensus sequence) result, which is slightly more prone to false positive indel errors.
- You will receive an email from TCAG notifying you that your data transfer is ready. Write down the data transfer ID from the email. It should be a six-digit alphanumeric code such as
VIOZRM9
. Also write down your TCAG username and password, as we will need it later. - Log into rothseq1 (
ssh [email protected]
). If you're located at the Donnelly Center or working from home, you'll need to turn on the LTRI VPN first. - If you haven't installed the TCAG client software on rothseq1 yet, you can copy the binary from Jochen's bin directory:
cp /home/rothlab/jweile/bin/tcag-client-1.4.2 ~/bin/
. (Assuming you already have abin/
directory and it's on your$PATH
, if not, revisit the Galen setup. - Go to the pacbio directory in the rothlab common folder:
cd /home/rothlab/common/seq_data/pacbio/
- In that directory you will find gene-specific subdirectories, go to the one applicable to your project (
cd LDLR/
) or create a new one if necessary (mkdir LDLR && cd LDLR
). - We will now download the dataset using the TCAG client. However, since this can potentially take days, we want to dissociate this process from the terminal so that it continues running even after we log out. Normally, we would use
nohup
for such a situation. However, the TCAG client still requires that we give it some manual keyboard inputs for certain prompts early on. Therefore, we'll have to use a workaround. We will start the process with a output redirection to a log file. Then we can use a second ssh connection to monitor the contents of the log file to see the prompts we need to answer. After all the prompts have been answered, we can then disassociate the process from the terminal.- Start a second ssh session from a secondary terminal (
ssh [email protected]
) and go to the same directory:cd /home/rothlab/common/seq_data/pacbio/LDLR/
. - Back in the first window, start the tcag client download with an output redirection to a file we will call
tcag.log
. In the example here, I'm using the transfer ID VIOZRM9, but you'll need to replace this with your own transfer ID:tcag-client-1.4.2 download -p VIOZRM9:/>tcag.log 2>&1
. - Switch back to the second terminal to display the prompts from the client that are being redirected into the log file:
tail -f tcag.log
- Switch back to the first terminal again and answer the prompts for your username and password there (the text will appear in the second window even as you type in the first).
- After all the prompts have been answered, the download will start. Now we can move the process to the background and dissociate it from the terminal. To do so, first put the process to sleep using
CTRL-Z
. In the terminal prompt that appears, typebg
to resume the process to the background. You will see the download spring back to life in the second terminal. Now all that is left to do is the terminal dissociation:disown -h %1
. - You can now log out the first terminal (
exit
), while the process continues running as evidenced by the continued output appearing in the log file in the second window. - You can either stop following the log file now (
CTRL-C
) and log out (exit
) or continue following the download if you like.
- Start a second ssh session from a secondary terminal (
- After a few days, the download should be complete. You can check on it periodically by looking at the log file (via
tail tcag.log
). - Inspect the files. The folder structure will be a bit convoluted. The top folder should be named according to the transfer number. Inside that will be another folder named after the TCAG order ID. Within that will be the different sub-projects (e.g. for different genes), finally within that should be folders for CLR and CCS results (depending on what was ordered). Each CLR folder should contain a file ending in
.subreads.bam
. This is the set of full-length CLR reads, wich can be processed with DeepConsensus. If you don't have the raw CLR data you'll just have a file ending inreads.bam
which contains the processed CCS data.
If you used index tags to multiplex your pacbio library, you'll need to use the lima
tool to demultiplex your reads. Depending on how the barcoding was done, the demultiplexing will also need to be run with different parameters. Consult the Lima Manual and lima --help
for more details. Most importantly, find out whether each sample is meant to have an identical upstream and downstream tags, or whether they are different (which then must be reflected in using the --same
or --different
flags.)
- Install
lima
via conda:
conda install -c bioconda lima
- Make a folder in which you would like to store the results of the demultiplexing process:
#this is obviously just an example, feel free to come up with your own file system organization.
mkdir -p ~/data/pacbio/LDLR202301 && cd ~/data/pacbio/LDLR202301
#softlink the CLR subreads BAM file. (you could point to the file directly instead, but this helps understand older results better in the future)
ln -s /home/rothlab/common/seq_data/pacbio/LDLR/6F3KPHW/COT22931.20230213/LDLR-R05/CLR/m64308e_230206_211453.subreads.bam
- Place a FASTA file containing the index tags in the folder. The default indices can be found here
- Submit your LIMA job to the cluster. Consult the Lima Manual for more details. Here we assume that the default barcodes were used and that the upstream and downstream barcodes must be identical.
submitjob.sh -n lima -c 12 -m 24 -t 1-00:00:00 -l lima.log -e lima.log --
lima m64308e_230206_211453.subreads.bam Sequel_RSII_96_barcodes_v1.fasta m64308e_230206_211453.subreads.demux.bam --same --min-score 80 --num-threads 12 --split-named
Even if you only have CCS data to work with, it's still worth installing pacbioTools to make the process of quality filtering easier.
- Log into
galen
- Installed
pacbioTools
yet, do so now:
#download pacbio tools scripts
mkdir -p ~/downloads && cd ~/downloads
wget https://github.com/rothlab/pacbioTools/archive/refs/heads/main.zip
unzip main.zip
#copy scripts to bin folder (assuming that $HOME/bin is on your $PATH)
cp pacbioTools-main/*.sh* ~/bin/
- Use conda to install the following tools:
pbccs
,actc
, andbam2fastx
tools:
conda install -c bioconda actc pbccs pbbam
- Install DeepConsensus:
pip install deepconsensus[cpu]==1.2.0
DeepConsensus also needs access to the underlying DeepLearning model. The files for this model are already downloaded to galen at /home/rothlab/common/deepconsensus/
If you did not receive raw CLR reads, you won't be able to run DeepConsensus. Skip ahead to the quality filtering setion instead. But if you do have CLR data:
- Make a folder in which you would like to store the results of this DeepConsensus run.
#this is obviously just an example, feel free to come up with your own file system organization.
mkdir -p ~/data/pacbio/LDLR202301 && cd ~/data/pacbio/LDLR202301
#softlink the CLR subreads BAM file:
ln -s /home/rothlab/common/seq_data/pacbio/LDLR/6F3KPHW/COT22931.20230213/LDLR-R05/CLR/m64308e_230206_211453.subreads.bam
- Start a DeepConsensus master job on the cluster. This will spawn additional worker jobs. Even the master job does some heavy lifting though and requires at least 12 CPU cores and 24GB of RAM. It will also run for a fairly long time, depending on the size of the dataset, so I'm giving it a deadline of 7 days here:
submitjob.sh -n deepcons -c 12 -m 24 -t 7-00:00:00 -l dc.log -e dc.log -- deepcons.sh m64308e_230206_211453.subreads.bam
- Wait for your job to complete. You can monitor it with
squeue -u $USER
orwaitForJobs.sh -v
To see the full list of options for this script, consult the --help
option:
deepcons.sh --help
deepcons.sh v0.0.2
by Jochen Weile <[email protected]> 2023
Runs parallel CCS and DeepConsensus analysis on a Pacbio BAM file using a HPC cluster.
Recommended number of CPUs to run this script: >12
Usage: deepcons.sh [-j|--jobs <INTEGER>] [-o|--outdir <PATH>]
[-m|--modelpath <PATH>] [-q|--minRQ <FLOAT>] [-b|--blacklist <NODES>]
[--] <BAMFILE>
-j|--jobs : Number of jobs to run in parallel. Defaults to 30
-o|--outdir : Path to output directory. Defaults to ./
-m|--modelpath: Path to DeepConsensus Model. Defaults to /home/rothlab/common/deepconsensus/model/checkpoint
-q|--minRQ : Minimum read accuracy (RQ). Default 0.998
-b|--blacklist: Blacklist of nodes without AVX support. Default: galen1,galen2,galen3,galen4,galen5,galen6,galen7,galen8,galen9,galen10,galen11,galen13,galen14,galen15,galen16,galen17,galen20,galen22,galen23,galen24,galen25,galen26,galen27,galen28,galen30,galen31,galen32,galen34,galen35,galen36,galen37,galen38,galen40,galen41,galen42,galen43,galen44,galen45,galen46,galen68
<BAMFILE> : The Pacbio BAM file to process
The blacklist argument is automatically populated here, but is fairly important, as DeepConsensus only runs on machines with AVX support, which only a minority of galen nodes have. This may change in the future, in which case the blacklist will need to be amended.
While the script runs, it will generate a temporary directory called shards/
in which intermediate files are stored. After the full process has been completed, that directory will be deleted again. Instead, your working directory should now look something like this:
68K Feb 18 03:13 dc.log
2.8G m64308e_230206_211453.ccs.bam <- The output of regular Pacbio CCS (in uBAM format)
20M m64308e_230206_211453.ccs.bam.pbi <- An index file for the latter
1.7G m64308e_230206_211453.ccs.fastq.gz <- A FASTQ version of the CCS output
3.6K m64308e_230206_211453_ccs_reports.tgz <- A tarball with QC data on the CCS process
1.4G m64308e_230206_211453.deep.fastq.gz <- The output of DeepConsensus in FASTQ format
1.7M m64308e_230206_211453_logs.tgz <- Log files from the CCS and DeepConsensus process
115 m64308e_230206_211453.subreads.bam <- The original input file.
Both the CCS and DeepConsensus output have already been filtered for read quality based on the threshold provided in the -q
argument (0.998 by default).
If you already ran DeepConsensus, you can skip this section and go straight to running Pacybara.
If you did not receive raw CLR reads from the sequencing provider and were thus unable to run DeepConsensus, you'll only have pre-computed CCS data. Therefore you'll want to instead apply a quality filter to the CCS data. The pacbioTools
script collection we installed earlier will make this easier as well.
- Make a folder in which you would like to store the results of the filter tool.
#this is obviously just an example, feel free to come up with your own file system organization.
mkdir -p ~/data/pacbio/LDLR202301 && cd ~/data/pacbio/LDLR202301
#softlink the CCS reads BAM file:
ln -s /home/rothlab/common/seq_data/pacbio/LDLR/6F3KPHW/COT22931.20230213/LDLR-R05/CCS/m64308e_230206_211453.reads.bam
- Start a job running the
pacbioFilterRQ.sh
script frompacbioTools
. This will filter reads based on pacbio's internalrq
quality metric, which estimates a given read's accuracy.
submitjob.sh -n pbfilter -t 24:00:00 -c 2 -m 4G -l pbfilter.log -e pbfilter.log -- pacbioFilterRQ.sh m64308e_230206_211453.reads.bam --qualityCutoff 0.998
- If you're interested in a breakdown of the quality scores in the CCS file you can also use the
pacbioCheckQuality.sh
script to generate some QC plots for this:
submitjob.sh -n pbfilter -t 24:00:00 -c 2 -m 4G -l pbfilter.log -e pbfilter.log -- pacbioCheckQuality.sh m64308e_230206_211453.reads.bam
- Wait for your jobs to complete. You can monitor them with
squeue -u $USER
orwaitForJobs.sh -v
.
The installation depends on conda
. If you don't have conda
, please revisit the Galen setup page.
- Clone the
pacybara
github repository:
#create a folder in your home directory for the repo
mkdir ~/repos && cd ~/repos
#clone the repo
git clone https://github.com/rothlab/pacybara.git
- Run the installer:
cd pacybara
./install.sh
- Set up a workspace directory for all the files related to your pacybara run
#This is just an recommendation, feel free to name or organize your files as you wish.
mkdir -p ~/projects/pacybaraWorkspace && cd ~/projects/pacybaraWorkspace
mkdir fastqs parameters outputs
#create a link to the fastq file in the fastqs directory
ln -s -t fastqs/ ~/data/pacbio/LDLR202301/m64308e_230206_211453.deep.fastq.gz
#create an output folder for the run, for example:
mkdir -p outputs/LDLR_20230101/
- Use a text editor like nano to create a parameter sheet TXT file:
#for example:
nano
For the an example of the parameter sheet format, refer to this template. Take special care when looking at the following components:
- The
INFASTQ
andWORKSPACE
parameters in this example should point to~/projects/pacybaraWorkspace/fastqs/m64308e_230206_211453.deep.fastq.gz
and~/projects/pacybaraWorkspace/outputs/LDLR_20230101/
, respectively. - As part of the parameter sheet you will need to define the amplicon sequence and note the start and end positions of the ORF within. Also take note of the barcode degeneracy code(s) in the reference file. These will need to be recorded in the
ORFSTART
,ORFEND
andBARCODE
fields of the sheet. - Carefully choose the parameters for
MINBCQ
andMINQUAL
, as they are dependent on the overall distribution of Q-scores in your input FASTA, which can drastically differ depending on the number of passes in the Pacbio run (and are usually grossly over-inflated). If your input FASTQ file was processed via DeepConsensus you will need lower these values, as DeepConsenus produces much more reasonable Q-values than CCS. For example, for CCS data filtered at RQ>0.998,MINQUAL=67
andMINBCQ=70
is a good starting point, whereas for DeepConsensus,MINQUAL=27
andMINBCQ=32
would be better. (Important: Don't just take these examples at face value! Check your actual FASTQ file contents first to get a feel of the Q-score distribution!) - For the
CLUSTERMODE
entry, consider whether you want to cluster with respect to the uptag, downtag or both (i.e. "virtual"). To catch and exclude PCR chimeras, virtual barcodes are best, but since such chimeras will be ultimately filtered out, that may hurt your coverage. So if your coverage is already low, you may just need to ignore chimeras and use uptag- or downtag-based clustering instead (depending on which barcode you're planning to short-read sequence for your screen.) To save and close the parameter sheet with nano, useCTRL-S
andCTRL-X
respectively.
- Activate the pacybara conda environment:
conda activate pacybara
- Submit the pacybara job. Here we assume your parameter sheet is now located at
~/projects/pacybaraWorkspace/parameters/pacyParam_LDLR_20230101.txt
. We will request 12CPU cores, 24GB of RAM and 36 hours runtime for the job:
submitjob.sh -n myPacybaraJob -c 12 -m 24G -t 36:00:00 -l pacybara.log -e pacybara.log --conda pacybara -- pacybara.sh ~/projects/pacybaraWorkspace/parameters/pacyParam_LDLR_20230101.txt
- Wait for your jobs to complete. You can follow the progress with via
tail -f pacybara.log
. (To exittail
use CTRL-C).
The output directory will contain multiple items:
- A bam file of all the reads aligned against the amplicon
- A tarball containing the log files of all the parallel alignments
- A directory ending in
_extract
. This contains fastq files of the extracted barcodes for each read and a file calledgenotypes.csv.gz
which contains the extracted ORF genotypes for each read. - A directory ending in
_clustering
. This contains a number of intermediate files, as well as the finalclusters_transl.csv.gz
andclusters_transl_filtered.csv.gz
-
clusters_transl.csv.gz
contains an unfiltered table of all final clusters (i.e. clones) and barcodes and genotypes. -
clusters_transl_filtered.csv.gz
is the same, but filtered for clones supported by >1 CCS read and no barcode collisions. -
clusters_transl_softfilter.csv.gz
is filtered less strictly. It still requires >1 CCS read, but allows clones from barcode collisions as long as they dominate that barcode with at least a 2/3 majority. - Within the
*_clustering
directory you will also find aqc/
sub-directory. This contains a number of QC plots.
-
In the *_clustering/qc/
folder, there are several plots to visualize aspects of the data:
-
barcodeBias_*.pdf
shows whether there are nucleotide biases in the barcode -
clusterSizes.pdf
shows a histogram of reads per clone before and after clustering. (i.e. using traditional perfect barcode matching vs pacybara's clustering method) -
compromisedBC.pdf
shows the breakdown of the soft filtering method above (i.e. how many clones had unique barcods, vs non-unique with dominant clone and non-unique without dominant clone) -
extractionQC.pdf
shows how many reads were successfully processed vs how many reads were disqualified for various reasons -
jackpots.pdf
shows whether there are any suspiciously highly-abundant clones -
nuclBias.pdf
shows nucleotide biases in SNVs and MNVs found in the library (separated by position in a given codon) -
pb_census.pdf
shows how many clones had how many mutations -
pb_coverage.pdf
shows a spatial coverage heatmap of mutations across the template sequence.
The pacybartender wrapper requires a library of barcode associations. This table can be made using any of the clusters_transl*
tables, depending on the desired filtering level. I recommend using the softfilter
version for best results. To convert it to the required format, you can use the pacybara_preplib.R
script:
#create a folder where we will store our library file
mkdir -p ~/projects/pacybaraWorkspaces/libraries/
#create the library file
pacybara_preplib.R clusters_transl_softfilter.csv.gz ~/projects/pacybaraWorkspaces/libraries/LDLR_20230101_barcodeLibrary.csv --mode up2up
Here is a the full help output for the script:
pacybara_preplib.R --help
usage: pacybara_preplib.R [--] [--help] [--opts OPTS] [--mode MODE]
clusters outfile
Prep Pacybara libraries for use in barseq
positional arguments:
clusters translated clusters csv.gz file
outfile output file name
flags:
-h, --help show this help message and exit
optional arguments:
-x, --opts RDS file containing argument values
-m, --mode translation mode. Valid arguments: up2up, down2down,
virt2up, virt2down. 'up2up' directly writes uptag-based
clusters to an uptag library. 'down2down' directly writes
downtag-based clusters to a downtag library. 'virt2up'
converts virtual-barcode-based clusters into an uptag
library. 'virt2down' converts virtual-barcode-based
clusters into a downtag library. [default: up2up]
Pacybartender is the new and improved pipeline for processing short-read BarSeq data and match it against established barcodes from long-read libraries. I previous approach (implemented as barseqPro) was matching each individual read against the reference barcodes directly. This method had the downside of incorrectly associating reads with barcodes because they were only a certain number of errors away, even though they actually represented their own clusters that were not captured in the reference library due to bottlenecking. Pacybartender instead uses bartender
Zhao et al., 2018 to cluster candidate barcodes from the short-read data first and only then matches the barcodes of the short-read clusters against the reference (long-read) clusters.
While the pipeline itself comes pre-installed with Pacybara, we still need to install the underlying bartender
program and its dependencies. Unfortunately, bartender is written for the outdated python version 2.7, which is no longer installed by default on the galen
cluster, so we first need to install the latter:
conda create -y -n py27 python=2.7
After that is done, we can install bartender:
#we're re-using the "repos" folder we created earlier in this manual, but feel free to use a different location
mkdir ~/repos && cd ~/repos
#clone a forked repo that contains some changes adapting it to the environment on galen
git clone https://github.com/jweile/bartender-1.1
#go to the repo folder
cd bartender-1.1
#compile the binaries from source code
make all
#and install them
make install
Give it a test to see whether the installation works in conjunction with the python2.7 conda environment:
conda run -n py27 bartender_extractor_com --help
- Set up a workspace directory for all the files related to your pacybara run
#In this example we're re-using the pacybaraWorkspace folder from earlier, but feel free to name or organize your files as you wish.
cd ~/projects/pacybaraWorkspace
#Technically we're already created these folders earlier for pacybara, but I'm listing them here again as a reminder.
mkdir -p parameters outputs
- Use a text editor like nano to create a parameter sheet TXT file:
#for example:
nano
For the an example of the parameter sheet format, refer to this template.
- When designing the sample table, make make sure that it uses tab-separation, and that the sample names correspond directly to the names of the FASTQ files.
- Make sure to set the
REVCOMP
flag correctly, depending on whether your reads are in the same orientation as the ORF. - The
LIBRARY
variable should point to the location of the barcode library that we generated usingpacybara_preplib.R
earlier. (In the example above, that was~/projects/pacybaraWorkspaces/libraries/LDLR_20230101_barcodeLibrary.csv
) - The flanking sequences should always be given in the same orientation as the original ORF sequence. If the reads are in reverse complement, don't invert the flanking sequences, but use the
REVCOMP
flag instead (as mentioned above).
- Running pacybartender via
submitjob.sh
and supply the appropriate conda environment. Here we assume that the FASTQ files for this barseq run are located at~/data/LDLR_20230101_barseq_FASTQ/
.
cd ~/projects/pacybaraWorkspace/outputs
submitjob.sh -n pbt -c 12 -m 24G -t 7-00:00:00 -l pbt.log -e pbt.log -- \
pacybartender.sh --conda py27 ~/data/LDLR_20230101_barseq_FASTQ/ ../parameters/barseqParam_LDLR_20230101.txt
- As before, wait for your jobs to complete. You can follow the progress with via
tail -f pbt.log
. (To exittail
use CTRL-C).
After Pacybartender has completed successfully, the output folder should contain the following elements:
- A copy of the parameter sheet and a samples.txt file (which simply contains the sample table from the parameter sheet)
- a
logs
directory, containing log files for each sample-specific job that was run on the cluster. - An
extract
directory, which contains the barcode sequences that bartender extracted for each read in each sample, as well as a CSV file that lists the cluster memberships inferred for each of those barcodes. - A
counts
directory, that contains the following:- Barcode cluster counts/sizes for each sample
-
allCounts.csv
which maps those cluster sizes to the reference library (from the long-read data) -
noMatchCounts.csv
which lists all clusters (and their sizes) that have no equivalent match in th reference library. -
qualityMatrices.tgz
, which contains the cluster quality information for each cluster. (See bartender documentation for more detail).
- A
scores
directory, with the following elements:-
allLRs.csv
which lists the pre-/post-selection log-ratios for each clone (i.e. cluster). -
allScores.csv
which contains all scoring information for clones in all selection conditions. -
all_aa_scores_*.csv
which contain condition-specific scores resolved to the amino-acid-change level (rather than separate scores for each clone). -
joint_scores_*.csv
which contains the simplified (MaveDB-formatted) scores at the amino-acid level for a given selection condition.
-
- A
qc
directory with the following elements:-
freqCV
: A scatterplot of frequency vs coefficient of variation. -
readFates
: A barplot showing a breakdown of the ultimate fates of the reads in each sample. I.e., whether a barcode was successfully extrated, whether it was filtered out, whether it clustered, and whether it successfully matched a reference clone. -
barseqDistributions*.pdf
: Histograms of the score distributions for WT, synonymous, nonsense, framemshift and missense variants in a given condition. -
bcFreqDistribution.pdf
: A histogram of the distribution of variant frequencies.
-