CCS 6.0.0

armintoepfer · armintoepfer · commit 178f15a099e6 · 2020-12-16T10:12:44.000+01:00
diff --git a/docs/changelog.md b/docs/changelog.md
@@ -6,7 +6,14 @@ nav_order: 99
 
 # Version changelog
 
-**5.0.0**
+**6.0.0**
+   * Increase number of HiFi reads
+   * Increase percentage of barcode yield
+   * Run time, CPU time, and peak RSS improvements
+   * Change main draft algorithm from pbdagcon to sparc
+   * Replace minimap2 with pancake and edlib/KSW2
+
+5.0.0
    * SMRT Link v10.0 release
    * Add `--hifi-kinetics` to average kinetic information for polished reads
    * Add `--all-kinetics` to add kinetic information for all ZMWs, except for unpolished draft consensus
diff --git a/docs/faq/bioconda-binary.md b/docs/faq/bioconda-binary.md
@@ -16,10 +16,10 @@ A modern (post-2008) CPU with support for
 SMRT Link also has this requirement.
 
 **`FATAL: kernel too old`** Your OS or rather your kernel version is not supported.
-Since CCS v4.2 we also ship a second binary via bioconda `ccs-alt`, which does
+Since _ccs_ v4.2 we also ship a second binary via bioconda `ccs-alt`, which does
 not bundle a newer `glibc`. Please use this alternative binary.
 
-For CCS v5.0, we offer two binaries in bioconda:
+For _ccs_, we offer two binaries in bioconda:
 
  * `ccs`, statically links `glibc` v2.32 and `mimalloc` v1.3.0.
  * `ccs-alt`, was build by dynamically linking `glibc` v2.12, but statically links `mimalloc` v1.3.0.
diff --git a/docs/faq/chemistry.md b/docs/faq/chemistry.md
@@ -3,6 +3,32 @@ layout: default
 parent: FAQ
 title: Chemistry
 ---
+## Supported chemistries
+The latest _ccs_ v6 supports following combinations of binding and
+sequencing kit part numbers:
+
+| BindingKit  | SequencingKit |    Chemistry     |  System   |
+| :---------: | :-----------: | :--------------: | :-------: |
+| 101-500-400 |  101-427-500  | S/P3-C3/5.0      | Sequel    |
+| 101-500-400 |  101-427-800  | S/P3-C3/5.0      | Sequel    |
+| 101-500-400 |  101-646-800  | S/P3-C3/5.0      | Sequel    |
+| 101-490-800 |  101-490-900  | S/P3-C1/5.0-8M   | Sequel II |
+| 101-490-800 |  101-491-000  | S/P3-C1/5.0-8M   | Sequel II |
+| 101-490-800 |  101-644-500  | S/P3-C1/5.0-8M   | Sequel II |
+| 101-490-800 |  101-717-100  | S/P3-C1/5.0-8M   | Sequel II |
+| 101-717-300 |  101-644-500  | S/P3-C1/5.0-8M   | Sequel II |
+| 101-717-300 |  101-717-100  | S/P3-C1/5.0-8M   | Sequel II |
+| 101-717-400 |  101-644-500  | S/P3-C1/5.0-8M   | Sequel II |
+| 101-717-400 |  101-717-100  | S/P3-C1/5.0-8M   | Sequel II |
+| 101-789-500 |  101-789-300  | S/P4-C2/5.0-8M   | Sequel II |
+| 101-820-500 |  101-789-300  | S/P4.1-C2/5.0-8M | Sequel II |
+| 101-789-500 |  101-826-100  | S/P4-C2/5.0-8M   | Sequel II |
+| 101-789-500 |  101-820-300  | S/P4-C2/5.0-8M   | Sequel II |
+| 101-820-500 |  101-826-100  | S/P4.1-C2/5.0-8M | Sequel II |
+| 101-820-500 |  101-820-300  | S/P4.1-C2/5.0-8M | Sequel II |
+| 101-894-200 |  101-826-100  | S/P5-C2/5.0-8M   | Sequel II |
+| 101-894-200 |  101-789-300  | S/P5-C2/5.0-8M   | Sequel II |
+| 101-894-200 |  101-820-300  | S/P5-C2/5.0-8M   | Sequel II |
 
 ## Help! I am getting "Unsupported ..."!
 If you encounter the error `Unsupported chemistries found: (...)` or
diff --git a/docs/faq/low-complexity.md b/docs/faq/low-complexity.md
@@ -4,7 +4,7 @@ parent: FAQ
 title: Low complexity
 ---
 
-## Does CCS dislike low-complexity regions?
+## Does _ccs_ dislike low-complexity regions?
 Low-complexity comes in many shapes and forms.
 A particular challenge for _ccs_ are highly enriched tandem repeats, like
 hundreds of copies of `AGGGGT`.
@@ -13,7 +13,7 @@ a consensus sequence.
 Since _ccs_ v5.0, every ZMW is tested if it contains a tandem repeat
 of length `--min-tandem-repeat-length 1000`.
 For this, we use [symmetric DUST](https://doi.org/10.1089/cmb.2006.13.1028)
-and in particular this [sdust](https://github.com/lh3/sdust) implementation,
+and in particular the [sdust](https://github.com/lh3/sdust) implementation,
 but slightly modified.
 If a ZMW is flagged as a tandem repeat, internally `--disable-heuristics`
 is activated for only this ZMW, and various filters that are known to exclude
diff --git a/docs/faq/mode-all.md b/docs/faq/mode-all.md
@@ -11,7 +11,7 @@ Similar to the CLR instrument mode, in which subreads are accompanied by
 a scraps file, _ccs_ offers a new mode to never lose a single read due to
 filtering, without massive run time increase by polishing low-pass productive ZMWs.
 
-Starting with SMRT Link v10.0 and Sequel IIe, _ccs_ v5.0 is able to generate
+Starting with SMRT Link v10.0 and Sequel IIe, _ccs_ v5.0 or newer is able to generate
 one representative sequence per productive ZMW, irrespective of quality and passes.
 This ensures no yield loss due to filtering and enables users to have maximum
 control over their data. Never fear again that SMRT Link or the Sequel IIe
diff --git a/docs/faq/performance.md b/docs/faq/performance.md
@@ -4,26 +4,71 @@ parent: FAQ
 title: Performance
 ---
 
-## How fast is CCS?
-We tested CCS runtime using 500 ZMWs per length bin with exactly 7 passes.
+## How fast is _ccs_?
+### Latest version
+The latest _ccs_ v6 can process 200 GBases HiFi yield in 24 hours for a 25 KBases
+library on 2x64 cores at 2.4 GHz.
+To put this into perspective for actual sequencing collections:
+
+|  Sample  | Insert size | HiFi Yield  | Run Time |
+| :------: | :---------: | :---------: | :------: |
+|  HG002   |  15 KBases  | 41.1 GBases |  5h 52m  |
+|  HG002   |  18 KBases  | 34.0 GBases |  4h 36m  |
+| Readwood |  25 KBases  | 32.4 GBases |  3h 46m  |
+
+### Relative performance v3.0 to v6.0
+Current _ccs_ v6 achieves a >150x speed-up for 20 KBases inserts compared to
+v3.0 from SMRT Link 6.0 release in 2018.
+
+### Algorithmic complexity
+To understand how this performance gain was possible, an overview how we changed
+the algorithmic complexity and how _ccs_ scales with insert size and number of passes:
+
+| CCS version | O(insert size) |  O(#passes)   |
+| :---------: | :------------: | :-----------: |
+|   ≤3.0.0    |   quadratic    |    linear     |
+|    3.4.1    |   **linear**   |    linear     |
+|   ≥4.0.0    |     linear     | **sublinear** |
+
+To visualize this table, we benchmarked runtime using 500 ZMWs per length bin with
+exactly 7 passes.
 
 <img width="1000px" src="../img/runtime.png"/>
 
-### How does that translate into time to result per SMRT Cell?
-We will measure time to result for Sequel System and Sequel System II CCS sequencing collections
-on a PacBio recommended HPC, according to the
-[Sequel II System Compute Requirements](https://www.pacb.com/wp-content/uploads/SMRT_Link_Installation_v701.pdf)
-with 192 physical or 384 hyper-threaded cores.
+After v4.0.0, the slope of the curve does not change, as the complexity class
+hasn't changed; only improvements independent of input type were made.
+
+### Performance comparisons
+Performance comparisons on different libraries; the `faster` column is with
+respect to the run time of the previous version. All runs were performed on the
+same hardware with 256 threads. A major part of the speed increase in v5.0 is
+due to toolchain improvements for generating a more optimized binary.
+#### **HG002 15kb SQII, 41 GBases HiFi yield**
 
-1) Sequel System: 15 kb insert size, 24-hours movie, 37 GB raw yield, 2.3 GB HiFi UMY
-2) Sequel II System: 15 kb insert size, 30-hours movie, 340 GB raw yield, 24 GB HiFi UMY
+| CCS Version | HiFi Reads | Run Time | CPU Time | Peak RSS | Faster |
+| :---------: | :--------: | :------: | :------: | :------: | :----: |
+|    4.0.0    | 2,765,431  | 13h 14m  | 89d 13h  |  71 GB   |        |
+|    4.2.0    | 2,806,886  | 10h 47m  |  61d 9h  |  72 GB   |  18%   |
+|    5.0.0    | 2,807,317  |  6h 44m  | 62d 22h  |  27 GB   |  37%   |
+|    6.0.0    | 2,831,192  |  5h 52m  | 44d 17h  |  20 GB   |  13%   |
 
-| CCS version | Sequel System | Sequel II System |
-| :-: | :-: | :-: |
-| ≤3.0.0 | 1 day | >1 week |
-| 3.4.1 | 3 hours | >1 day |
-| 4.0.0 | 40 minutes | 6 hours |
-| ≥4.2.0 | **30 minutes** | **4 hours** |
+#### **HG002 18kb SQII, 32 GBases HiFi yield**
+Omitting v4.0.0, due to lack of chemistry support.
+
+| CCS Version | HiFi Reads | Run Time | CPU Time | Peak RSS | Faster |
+| :---------: | :--------: | :------: | :------: | :------: | :----: |
+|    4.2.0    |  1823016   |  8h 35m  | 47d 13h  |  80 GB   |        |
+|    5.0.0    |  1824206   |  5h 29m  | 50d 16h  |  46 GB   |  36%   |
+|    6.0.0    |  1855604   |  4h 36m  | 30d 13h  |  18 GB   |  15%   |
+
+#### **Redwood 25kb SQII, 32 GBases HiFi yield**
+
+| CCS Version | HiFi Reads | Run Time | CPU Time | Peak RSS | Faster |
+| :---------: | :--------: | :------: | :------: | :------: | :----: |
+|    4.0.0    | 1,269,680  |  7h 58m  | 60d 19h  |  72 GB   |        |
+|    4.2.0    | 1,310,775  |  6h 37m  | 43d 18h  |  74 GB   |  17%   |
+|    5.0.0    | 1,311,693  |  4h 36m  | 41d 13h  |  41 GB   |  30%   |
+|    6.0.0    | 1,335,888  |  3h 56m  | 25d 11h  |  17 GB   |  14%   |
 
 ### How is CCS speed affected by raw base yield?
 Raw base yield is the sum of all polymerase read lengths.
@@ -39,14 +84,6 @@ ZMWs per SMRT Cell.
 Starting with version 3.3.0 _ccs_ scales linear in (2) the polymerase read length
 and with version 4.0.0 _ccs_ scales sublinear.
 
-### What did change in each version?
-
-| CCS version | O(insert size) | O(#passes) |
-| :-: | :-: | :-: |
-| ≤3.0.0 | quadratic | linear |
-| 3.4.1 | **linear** | linear |
-| ≥4.0.0 | linear | **sublinear** |
-
 ### How can version 4.0.0 be sublinear in the number of passes?
 With the introduction of improved heuristics, individual draft bases can skip
 polishing if they are of sufficient quality.
@@ -57,13 +94,13 @@ No, we optimized _ccs_ such that there is a good balance between speed and
 output quality.
 
 ## Does speed impact quality and yield?
-Yes it does. With ~35x speed improvements from version 3.1.0 to 4.0.0 and
-consequently reducing CPU time from >60,000 to <2,000 core hours,
-heuristics and changes in algorithms lead to slightly lower yield and
+Yes it does. With >150x speed improvements from version 3.0 to 6.0,
+heuristics and changes in algorithms lead to slightly different yield and
 accuracy if run head-to-head on the same data set. Internal tests show
-that _ccs_ 4.0.0 introduces no regressions in CCS-only Structural Variant
+that _ccs_ 6.0 introduces no regressions in _ccs_-only Structural Variant
 calling and has minimal impact on SNV and indel calling in DeepVariant.
-In contrast, lower DNA quality has a bigger impact on quality and yield.
+In contrast, lower DNA quality and sample preparation has a bigger impact
+on quality and yield.
 
 ## Can I tune performance without sacrificing output quality?
 The bioconda _ccs_ ≥v5.0 binaries statically link [mimalloc](https://github.com/microsoft/mimalloc).
diff --git a/docs/faq/reports-aux-files.md b/docs/faq/reports-aux-files.md
@@ -41,6 +41,28 @@ The following comments refer to the filters that are explained in the FAQ above.
 If run in `--by-strand` mode, rows may contain half ZMWs, as we account
 each strand as half a ZMW.
 
+### Coverage drops
+Example for a coverage drop in a single ZMW, subreads colored by strand orientation:
+
+<p align="center">
+  <img width="500px" src="../img/coveragedrop.png" />
+</p>
+
+During sequencing of the molecule, one strand exhibits 744 more bases than its
+reverse complemented strand. What happened?
+Either there is a gain or loss of information.
+An explanation for loss of information could be that a secondary structure,
+the 744 bp forming a hairpin, could affect the replication during PCR and lead
+to loss of bases.
+Gain of information could also happen during PCR, when the polymerase gets stuck
+and incorporates the current base too often.
+In this example, there is a homopolymer of 744 `A` bases.
+While it might be obvious to a human eye what happened,
+its not the responsibility of _ccs_ to interpret and recover molecular damage.
+Even if there were a low-complexity filter for those regions, setting the
+appropriate threshold would be arbitrary;
+would a 10bp homopolymer insertion be valid, but 11bp would get discarded?
+
 ## How do I read the zmw_metrics.json file?
 Per default, each _ccs_ run generates a `<outputPrefix>.zmw_metrics.json.gz` file.
 Change file name with `--metrics-json`.
diff --git a/docs/how-does-ccs-work.md b/docs/how-does-ccs-work.md
@@ -35,22 +35,24 @@ To avoid improper mappings, short subreads are excluded.
 The polish stage iteratively improves upon a candidate template sequence.
 Because polishing is very compute intensive, it is desirable to start with a
 template that is as close as possible to the true sequence of the molecule to
-reduce the number of iterations until convergence.
-So, the _ccs_ software does not pick a full-length subread as the initial
-template to be polished, but instead generates an approximate draft consensus
-sequence using graph algorithms like [partial-order alignment](https://academic.oup.com/bioinformatics/article/18/3/452/236691) (POA)
-[consensus](https://academic.oup.com/bioinformatics/article/19/8/999/235258),
-employing an accelerated implementation called [SPOA](https://github.com/rvaser/spoa),
-or our own alignment graph consensus caller, called pbdagcon.
+reduce the number of iterations until convergence. The _ccs_ software does
+not pick a full-length subread as the initial template to be polished, but
+instead generates an approximate draft consensus sequence using our improved
+implementation of the [Sparc graph consensus algorithm](https://doi.org/10.7717/peerj.2016).
+This algorithm depends on a subread to backbone alignment that is generated
+by our own mapper [pancake](https://github.com/PacificBiosciences/pancake)
+using [edlib](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5408825/) as the core
+aligner.
 Typically, subreads have accuracy of around 90% and the draft consensus has a
-higher accuracy, but depending on the algorithm employed is still below 99%.
+higher accuracy, but is still below 99%.
 
 <p align="center"><img width="1000px" src="img/draft.png"/></p>
 
 Stop if draft length is shorter than `--min-length` and longer than `--max-length`.
 
 ## 3. Alignment
-Align subreads to the draft consensus for downstream windowing and filtering.
+Align subreads to the draft consensus using pancake with
+[KSW2](https://github.com/lh3/ksw2) for downstream windowing and filtering.
 
 ## 4. Windowing
 Divide the the subread-to-draft alignment into overlapping windows with a target
diff --git a/docs/img/coveragedrop.png b/docs/img/coveragedrop.png
diff --git a/docs/img/run-design-kinetics.png b/docs/img/run-design-kinetics.png
diff --git a/docs/img/run-design-oiccs.png b/docs/img/run-design-oiccs.png
diff --git a/docs/index.md b/docs/index.md
@@ -31,7 +31,7 @@ Please refer to our [official pbbioconda page](https://github.com/PacificBioscie
 for information on Installation, Support, License, Copyright, and Disclaimer.
 
 ## Latest Version
-Version **5.0.0**: [Full changelog here](/changelog)
+Version **6.0.0**: [Full changelog here](/changelog)
 
 ## What's new!
 _ccs_ is now running on the Sequel IIe instrument, transferring HiFi reads