Skip to content

[Barseq] Analyzing Barseq data

Jochen Weile edited this page Jul 12, 2023 · 4 revisions

Analyzing BarSeq data

Using Full CLR reads from Hifi data

Even though we use PacBio Hifi runs, we ideally request the raw CLR (continuous-long-read) data for each run, as this enables us to run DeepConsensus for improved circular consensus read quality. If you don't have the raw CLR data, you'll have to contend with the default CCS (circular consensus sequence) result, which is slightly more prone to false positive indel errors.

Downloading Pacbio data from TCAG

  1. You will receive an email from TCAG notifying you that your data transfer is ready. Write down the data transfer ID from the email. It should be a six-digit alphanumeric code such as VIOZRM9. Also write down your TCAG username and password, as we will need it later.
  2. Log into rothseq1 (ssh [email protected]). If you're located at the Donnelly Center or working from home, you'll need to turn on the LTRI VPN first.
  3. If you haven't installed the TCAG client software on rothseq1 yet, you can copy the binary from Jochen's bin directory: cp /home/rothlab/jweile/bin/tcag-client-1.4.2 ~/bin/. (Assuming you already have a bin/ directory and it's on your $PATH, if not, revisit the Galen setup.
  4. Go to the pacbio directory in the rothlab common folder: cd /home/rothlab/common/seq_data/pacbio/
  5. In that directory you will find gene-specific subdirectories, go to the one applicable to your project (cd LDLR/) or create a new one if necessary (mkdir LDLR && cd LDLR).
  6. We will now download the dataset using the TCAG client. However, since this can potentially take days, we want to dissociate this process from the terminal so that it continues running even after we log out. Normally, we would use nohup for such a situation. However, the TCAG client still requires that we give it some manual keyboard inputs for certain prompts early on. Therefore, we'll have to use a workaround. We will start the process with a output redirection to a log file. Then we can use a second ssh connection to monitor the contents of the log file to see the prompts we need to answer. After all the prompts have been answered, we can then disassociate the process from the terminal.
    • Start a second ssh session from a secondary terminal (ssh [email protected]) and go to the same directory: cd /home/rothlab/common/seq_data/pacbio/LDLR/.
    • Back in the first window, start the tcag client download with an output redirection to a file we will call tcag.log. In the example here, I'm using the transfer ID VIOZRM9, but you'll need to replace this with your own transfer ID: tcag-client-1.4.2 download -p VIOZRM9:/>tcag.log 2>&1.
    • Switch back to the second terminal to display the prompts from the client that are being redirected into the log file: tail -f tcag.log
    • Switch back to the first terminal again and answer the prompts for your username and password there (the text will appear in the second window even as you type in the first).
    • After all the prompts have been answered, the download will start. Now we can move the process to the background and dissociate it from the terminal. To do so, first put the process to sleep using CTRL-Z. In the terminal prompt that appears, type bg to resume the process to the background. You will see the download spring back to life in the second terminal. Now all that is left to do is the terminal dissociation: disown -h %1.
    • You can now log out the first terminal (exit), while the process continues running as evidenced by the continued output appearing in the log file in the second window.
    • You can either stop following the log file now (CTRL-C) and log out (exit) or continue following the download if you like.
  7. After a few days, the download should be complete. You can check on it periodically by looking at the log file (via tail tcag.log).
  8. Inspect the files. The folder structure will be a bit convoluted. The top folder should be named according to the transfer number. Inside that will be another folder named after the TCAG order ID. Within that will be the different sub-projects (e.g. for different genes), finally within that should be folders for CLR and CCS results (depending on what was ordered). Each CLR folder should contain a file ending in .subreads.bam. This is the set of full-length CLR reads, wich can be processed with DeepConsensus. If you don't have the raw CLR data you'll just have a file ending in reads.bam which contains the processed CCS data.

Demultiplexing Pacbio data

If you used index tags to multiplex your pacbio library, you'll need to use the lima tool to demultiplex your reads. Depending on how the barcoding was done, the demultiplexing will also need to be run with different parameters. Consult the Lima Manual and lima --help for more details. Most importantly, find out whether each sample is meant to have an identical upstream and downstream tags, or whether they are different (which then must be reflected in using the --same or --different flags.)

  1. Install lima via conda:
conda install -c bioconda lima
  1. Make a folder in which you would like to store the results of the demultiplexing process:
#this is obviously just an example, feel free to come up with your own file system organization.
mkdir -p ~/data/pacbio/LDLR202301 && cd ~/data/pacbio/LDLR202301
#softlink the CLR subreads BAM file. (you could point to the file directly instead, but this helps understand older results better in the future)
ln -s /home/rothlab/common/seq_data/pacbio/LDLR/6F3KPHW/COT22931.20230213/LDLR-R05/CLR/m64308e_230206_211453.subreads.bam
  1. Place a FASTA file containing the index tags in the folder. The default indices can be found here
  2. Submit your LIMA job to the cluster. Consult the Lima Manual for more details. Here we assume that the default barcodes were used and that the upstream and downstream barcodes must be identical.
submitjob.sh -n lima -c 12 -m 24 -t 1-00:00:00 -l lima.log -e lima.log --
lima m64308e_230206_211453.subreads.bam Sequel_RSII_96_barcodes_v1.fasta m64308e_230206_211453.subreads.demux.bam --same --min-score 80 --num-threads 12 --split-named

Installing DeepConsensus and PacbioTools

Even if you only have CCS data to work with, it's still worth installing pacbioTools to make the process of quality filtering easier.

  1. Log into galen
  2. Installed pacbioTools yet, do so now:
#download pacbio tools scripts
mkdir -p ~/downloads && cd ~/downloads
wget https://github.com/rothlab/pacbioTools/archive/refs/heads/main.zip
unzip main.zip
#copy scripts to bin folder (assuming that $HOME/bin is on your $PATH)
cp pacbioTools-main/*.sh* ~/bin/
  1. Use conda to install the following tools: pbccs , actc, and bam2fastx tools:
conda install -c bioconda actc pbccs pbbam
  1. Install DeepConsensus:
pip install deepconsensus[cpu]==1.2.0

DeepConsensus also needs access to the underlying DeepLearning model. The files for this model are already downloaded to galen at /home/rothlab/common/deepconsensus/

Running DeepConsensus

If you did not receive raw CLR reads, you won't be able to run DeepConsensus. Skip ahead to the quality filtering setion instead. But if you do have CLR data:

  1. Make a folder in which you would like to store the results of this DeepConsensus run.
#this is obviously just an example, feel free to come up with your own file system organization.
mkdir -p ~/data/pacbio/LDLR202301 && cd ~/data/pacbio/LDLR202301
#softlink the CLR subreads BAM file:
ln -s /home/rothlab/common/seq_data/pacbio/LDLR/6F3KPHW/COT22931.20230213/LDLR-R05/CLR/m64308e_230206_211453.subreads.bam
  1. Start a DeepConsensus master job on the cluster. This will spawn additional worker jobs. Even the master job does some heavy lifting though and requires at least 12 CPU cores and 24GB of RAM. It will also run for a fairly long time, depending on the size of the dataset, so I'm giving it a deadline of 7 days here:
submitjob.sh -n deepcons -c 12 -m 24 -t 7-00:00:00 -l dc.log -e dc.log -- deepcons.sh m64308e_230206_211453.subreads.bam
  1. Wait for your job to complete. You can monitor it with squeue -u $USER or waitForJobs.sh -v

To see the full list of options for this script, consult the --help option:

 deepcons.sh --help
deepcons.sh v0.0.2

by Jochen Weile <[email protected]> 2023

Runs parallel CCS and DeepConsensus analysis on a Pacbio BAM file using a HPC cluster.
Recommended number of CPUs to run this script: >12

Usage: deepcons.sh [-j|--jobs <INTEGER>] [-o|--outdir <PATH>] 
    [-m|--modelpath <PATH>] [-q|--minRQ <FLOAT>] [-b|--blacklist <NODES>]
    [--] <BAMFILE>

-j|--jobs     : Number of jobs to run in parallel. Defaults to 30
-o|--outdir   : Path to output directory. Defaults to ./
-m|--modelpath: Path to DeepConsensus Model. Defaults to /home/rothlab/common/deepconsensus/model/checkpoint
-q|--minRQ    : Minimum read accuracy (RQ). Default 0.998
-b|--blacklist: Blacklist of nodes without AVX support. Default: galen1,galen2,galen3,galen4,galen5,galen6,galen7,galen8,galen9,galen10,galen11,galen13,galen14,galen15,galen16,galen17,galen20,galen22,galen23,galen24,galen25,galen26,galen27,galen28,galen30,galen31,galen32,galen34,galen35,galen36,galen37,galen38,galen40,galen41,galen42,galen43,galen44,galen45,galen46,galen68
<BAMFILE>     : The Pacbio BAM file to process

The blacklist argument is automatically populated here, but is fairly important, as DeepConsensus only runs on machines with AVX support, which only a minority of galen nodes have. This may change in the future, in which case the blacklist will need to be amended.

While the script runs, it will generate a temporary directory called shards/ in which intermediate files are stored. After the full process has been completed, that directory will be deleted again. Instead, your working directory should now look something like this:

68K Feb 18 03:13 dc.log
2.8G m64308e_230206_211453.ccs.bam         <- The output of regular Pacbio CCS (in uBAM format)
 20M m64308e_230206_211453.ccs.bam.pbi     <- An index file for the latter
1.7G m64308e_230206_211453.ccs.fastq.gz    <- A FASTQ version of the CCS output
3.6K m64308e_230206_211453_ccs_reports.tgz <- A tarball with QC data on the CCS process
1.4G m64308e_230206_211453.deep.fastq.gz   <- The output of DeepConsensus in FASTQ format
1.7M m64308e_230206_211453_logs.tgz        <- Log files from the CCS and DeepConsensus process
 115 m64308e_230206_211453.subreads.bam    <- The original input file.

Both the CCS and DeepConsensus output have already been filtered for read quality based on the threshold provided in the -q argument (0.998 by default).

Quality filtering pre-computed CCS data

If you already ran DeepConsensus, you can skip this section and go straight to running Pacybara.

If you did not receive raw CLR reads from the sequencing provider and were thus unable to run DeepConsensus, you'll only have pre-computed CCS data. Therefore you'll want to instead apply a quality filter to the CCS data. The pacbioTools script collection we installed earlier will make this easier as well.

  1. Make a folder in which you would like to store the results of the filter tool.
#this is obviously just an example, feel free to come up with your own file system organization.
mkdir -p ~/data/pacbio/LDLR202301 && cd ~/data/pacbio/LDLR202301
#softlink the CCS reads BAM file:
ln -s /home/rothlab/common/seq_data/pacbio/LDLR/6F3KPHW/COT22931.20230213/LDLR-R05/CCS/m64308e_230206_211453.reads.bam
  1. Start a job running the pacbioFilterRQ.sh script from pacbioTools. This will filter reads based on pacbio's internal rq quality metric, which estimates a given read's accuracy.
submitjob.sh -n pbfilter -t 24:00:00 -c 2 -m 4G -l pbfilter.log -e pbfilter.log -- pacbioFilterRQ.sh m64308e_230206_211453.reads.bam --qualityCutoff 0.998
  1. If you're interested in a breakdown of the quality scores in the CCS file you can also use the pacbioCheckQuality.sh script to generate some QC plots for this:
submitjob.sh -n pbfilter -t 24:00:00 -c 2 -m 4G -l pbfilter.log -e pbfilter.log -- pacbioCheckQuality.sh m64308e_230206_211453.reads.bam
  1. Wait for your jobs to complete. You can monitor them with squeue -u $USER or waitForJobs.sh -v.

Installing Pacybara

The installation depends on conda. If you don't have conda, please revisit the Galen setup page.

  1. Clone the pacybara github repository:
#create a folder in your home directory for the repo
mkdir ~/repos && cd ~/repos
#clone the repo
git clone https://github.com/rothlab/pacybara.git
  1. Run the installer:
cd pacybara
./install.sh

Running Pacybara

  1. Set up a workspace directory for all the files related to your pacybara run
#This is just an recommendation, feel free to name or organize your files as you wish.
mkdir -p ~/projects/pacybaraWorkspace && cd ~/projects/pacybaraWorkspace
mkdir fastqs parameters outputs
#create a link to the fastq file in the fastqs directory
ln -s -t fastqs/ ~/data/pacbio/LDLR202301/m64308e_230206_211453.deep.fastq.gz 
#create an output folder for the run, for example:
mkdir -p outputs/LDLR_20230101/
  1. Use a text editor like nano to create a parameter sheet TXT file:
#for example:
nano 

For the an example of the parameter sheet format, refer to this template. Take special care when looking at the following components:

  • The INFASTQ and WORKSPACE parameters in this example should point to ~/projects/pacybaraWorkspace/fastqs/m64308e_230206_211453.deep.fastq.gz and ~/projects/pacybaraWorkspace/outputs/LDLR_20230101/, respectively.
  • As part of the parameter sheet you will need to define the amplicon sequence and note the start and end positions of the ORF within. Also take note of the barcode degeneracy code(s) in the reference file. These will need to be recorded in the ORFSTART, ORFEND and BARCODE fields of the sheet.
  • Carefully choose the parameters for MINBCQ and MINQUAL, as they are dependent on the overall distribution of Q-scores in your input FASTA, which can drastically differ depending on the number of passes in the Pacbio run (and are usually grossly over-inflated). If your input FASTQ file was processed via DeepConsensus you will need lower these values, as DeepConsenus produces much more reasonable Q-values than CCS. For example, for CCS data filtered at RQ>0.998, MINQUAL=67 and MINBCQ=70 is a good starting point, whereas for DeepConsensus, MINQUAL=27 and MINBCQ=32 would be better. (Important: Don't just take these examples at face value! Check your actual FASTQ file contents first to get a feel of the Q-score distribution!)
  • For the CLUSTERMODE entry, consider whether you want to cluster with respect to the uptag, downtag or both (i.e. "virtual"). To catch and exclude PCR chimeras, virtual barcodes are best, but since such chimeras will be ultimately filtered out, that may hurt your coverage. So if your coverage is already low, you may just need to ignore chimeras and use uptag- or downtag-based clustering instead (depending on which barcode you're planning to short-read sequence for your screen.) To save and close the parameter sheet with nano, use CTRL-S and CTRL-X respectively.
  1. Activate the pacybara conda environment:
conda activate pacybara
  1. Submit the pacybara job. Here we assume your parameter sheet is now located at ~/projects/pacybaraWorkspace/parameters/pacyParam_LDLR_20230101.txt. We will request 12CPU cores, 24GB of RAM and 36 hours runtime for the job:
submitjob.sh -n myPacybaraJob -c 12 -m 24G -t 36:00:00 -l pacybara.log -e pacybara.log --conda pacybara -- pacybara.sh ~/projects/pacybaraWorkspace/parameters/pacyParam_LDLR_20230101.txt
  1. Wait for your jobs to complete. You can follow the progress with via tail -f pacybara.log. (To exit tail use CTRL-C).

Pacybara Output

The output directory will contain multiple items:

  • A bam file of all the reads aligned against the amplicon
  • A tarball containing the log files of all the parallel alignments
  • A directory ending in _extract. This contains fastq files of the extracted barcodes for each read and a file called genotypes.csv.gz which contains the extracted ORF genotypes for each read.
  • A directory ending in _clustering. This contains a number of intermediate files, as well as the final clusters_transl.csv.gz and clusters_transl_filtered.csv.gz
    • clusters_transl.csv.gz contains an unfiltered table of all final clusters (i.e. clones) and barcodes and genotypes.
    • clusters_transl_filtered.csv.gz is the same, but filtered for clones supported by >1 CCS read and no barcode collisions.
    • clusters_transl_softfilter.csv.gz is filtered less strictly. It still requires >1 CCS read, but allows clones from barcode collisions as long as they dominate that barcode with at least a 2/3 majority.
    • Within the *_clustering directory you will also find a qc/ sub-directory. This contains a number of QC plots.

QC plots

In the *_clustering/qc/ folder, there are several plots to visualize aspects of the data:

  • barcodeBias_*.pdf shows whether there are nucleotide biases in the barcode
  • clusterSizes.pdf shows a histogram of reads per clone before and after clustering. (i.e. using traditional perfect barcode matching vs pacybara's clustering method)
  • compromisedBC.pdf shows the breakdown of the soft filtering method above (i.e. how many clones had unique barcods, vs non-unique with dominant clone and non-unique without dominant clone)
  • extractionQC.pdf shows how many reads were successfully processed vs how many reads were disqualified for various reasons
  • jackpots.pdf shows whether there are any suspiciously highly-abundant clones
  • nuclBias.pdf shows nucleotide biases in SNVs and MNVs found in the library (separated by position in a given codon)
  • pb_census.pdf shows how many clones had how many mutations
  • pb_coverage.pdf shows a spatial coverage heatmap of mutations across the template sequence.

Processing short-read BarSeq data.

Converting the Pacybara output for use with pacybartender.

The pacybartender wrapper requires a library of barcode associations. This table can be made using any of the clusters_transl* tables, depending on the desired filtering level. I recommend using the softfilter version for best results. To convert it to the required format, you can use the pacybara_preplib.R script:

#create a folder where we will store our library file
mkdir -p ~/projects/pacybaraWorkspaces/libraries/
#create the library file
pacybara_preplib.R clusters_transl_softfilter.csv.gz ~/projects/pacybaraWorkspaces/libraries/LDLR_20230101_barcodeLibrary.csv --mode up2up

Here is a the full help output for the script:

pacybara_preplib.R --help
usage: pacybara_preplib.R [--] [--help] [--opts OPTS] [--mode MODE]
       clusters outfile

Prep Pacybara libraries for use in barseq

positional arguments:
  clusters    translated clusters csv.gz file
  outfile     output file name

flags:
  -h, --help  show this help message and exit

optional arguments:
  -x, --opts  RDS file containing argument values
  -m, --mode  translation mode. Valid arguments: up2up, down2down,
              virt2up, virt2down. 'up2up' directly writes uptag-based
              clusters to an uptag library. 'down2down' directly writes
              downtag-based clusters to a downtag library. 'virt2up'
              converts virtual-barcode-based clusters into an uptag
              library. 'virt2down' converts virtual-barcode-based
              clusters into a downtag library. [default: up2up]

Installing Bartender and Pacybartender

Pacybartender is the new and improved pipeline for processing short-read BarSeq data and match it against established barcodes from long-read libraries. I previous approach (implemented as barseqPro) was matching each individual read against the reference barcodes directly. This method had the downside of incorrectly associating reads with barcodes because they were only a certain number of errors away, even though they actually represented their own clusters that were not captured in the reference library due to bottlenecking. Pacybartender instead uses bartender Zhao et al., 2018 to cluster candidate barcodes from the short-read data first and only then matches the barcodes of the short-read clusters against the reference (long-read) clusters.

While the pipeline itself comes pre-installed with Pacybara, we still need to install the underlying bartender program and its dependencies. Unfortunately, bartender is written for the outdated python version 2.7, which is no longer installed by default on the galen cluster, so we first need to install the latter:

conda create -y -n py27 python=2.7

After that is done, we can install bartender:

#we're re-using the "repos" folder we created earlier in this manual, but feel free to use a different location
mkdir ~/repos && cd ~/repos
#clone a forked repo that contains some changes adapting it to the environment on galen
git clone https://github.com/jweile/bartender-1.1
#go to the repo folder
cd bartender-1.1
#compile the binaries from source code
make all
#and install them
make install

Give it a test to see whether the installation works in conjunction with the python2.7 conda environment:

conda run -n py27 bartender_extractor_com --help

Running Pacybartender

  1. Set up a workspace directory for all the files related to your pacybara run
#In this example we're re-using the pacybaraWorkspace folder from earlier, but feel free to name or organize your files as you wish.
cd ~/projects/pacybaraWorkspace
#Technically we're already created these folders earlier for pacybara, but I'm listing them here again as a reminder.
mkdir -p parameters outputs
  1. Use a text editor like nano to create a parameter sheet TXT file:
#for example:
nano 

For the an example of the parameter sheet format, refer to this template.

  • When designing the sample table, make make sure that it uses tab-separation, and that the sample names correspond directly to the names of the FASTQ files.
  • Make sure to set the REVCOMP flag correctly, depending on whether your reads are in the same orientation as the ORF.
  • The LIBRARY variable should point to the location of the barcode library that we generated using pacybara_preplib.R earlier. (In the example above, that was ~/projects/pacybaraWorkspaces/libraries/LDLR_20230101_barcodeLibrary.csv)
  • The flanking sequences should always be given in the same orientation as the original ORF sequence. If the reads are in reverse complement, don't invert the flanking sequences, but use the REVCOMP flag instead (as mentioned above).
  1. Running pacybartender via submitjob.sh and supply the appropriate conda environment. Here we assume that the FASTQ files for this barseq run are located at ~/data/LDLR_20230101_barseq_FASTQ/.
cd ~/projects/pacybaraWorkspace/outputs
submitjob.sh -n pbt -c 12 -m 24G -t 7-00:00:00 -l pbt.log -e pbt.log -- \
  pacybartender.sh --conda py27 ~/data/LDLR_20230101_barseq_FASTQ/ ../parameters/barseqParam_LDLR_20230101.txt
  1. As before, wait for your jobs to complete. You can follow the progress with via tail -f pbt.log. (To exit tail use CTRL-C).

Pacybartender output

After Pacybartender has completed successfully, the output folder should contain the following elements:

  • A copy of the parameter sheet and a samples.txt file (which simply contains the sample table from the parameter sheet)
  • a logs directory, containing log files for each sample-specific job that was run on the cluster.
  • An extract directory, which contains the barcode sequences that bartender extracted for each read in each sample, as well as a CSV file that lists the cluster memberships inferred for each of those barcodes.
  • A counts directory, that contains the following:
    • Barcode cluster counts/sizes for each sample
    • allCounts.csv which maps those cluster sizes to the reference library (from the long-read data)
    • noMatchCounts.csv which lists all clusters (and their sizes) that have no equivalent match in th reference library.
    • qualityMatrices.tgz, which contains the cluster quality information for each cluster. (See bartender documentation for more detail).
  • A scores directory, with the following elements:
    • allLRs.csv which lists the pre-/post-selection log-ratios for each clone (i.e. cluster).
    • allScores.csv which contains all scoring information for clones in all selection conditions.
    • all_aa_scores_*.csv which contain condition-specific scores resolved to the amino-acid-change level (rather than separate scores for each clone).
    • joint_scores_*.csv which contains the simplified (MaveDB-formatted) scores at the amino-acid level for a given selection condition.
  • A qc directory with the following elements:
    • freqCV: A scatterplot of frequency vs coefficient of variation.
    • readFates: A barplot showing a breakdown of the ultimate fates of the reads in each sample. I.e., whether a barcode was successfully extrated, whether it was filtered out, whether it clustered, and whether it successfully matched a reference clone.
    • barseqDistributions*.pdf: Histograms of the score distributions for WT, synonymous, nonsense, framemshift and missense variants in a given condition.
    • bcFreqDistribution.pdf: A histogram of the distribution of variant frequencies.
Clone this wiki locally