Skip to content

Commit 178f15a

Browse files
committed
CCS 6.0.0
1 parent 3da9e7c commit 178f15a

12 files changed

+138
-44
lines changed

docs/changelog.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,14 @@ nav_order: 99
66

77
# Version changelog
88

9-
**5.0.0**
9+
**6.0.0**
10+
* Increase number of HiFi reads
11+
* Increase percentage of barcode yield
12+
* Run time, CPU time, and peak RSS improvements
13+
* Change main draft algorithm from pbdagcon to sparc
14+
* Replace minimap2 with pancake and edlib/KSW2
15+
16+
5.0.0
1017
* SMRT Link v10.0 release
1118
* Add `--hifi-kinetics` to average kinetic information for polished reads
1219
* Add `--all-kinetics` to add kinetic information for all ZMWs, except for unpolished draft consensus

docs/faq/bioconda-binary.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,10 +16,10 @@ A modern (post-2008) CPU with support for
1616
SMRT Link also has this requirement.
1717

1818
**`FATAL: kernel too old`** Your OS or rather your kernel version is not supported.
19-
Since CCS v4.2 we also ship a second binary via bioconda `ccs-alt`, which does
19+
Since _ccs_ v4.2 we also ship a second binary via bioconda `ccs-alt`, which does
2020
not bundle a newer `glibc`. Please use this alternative binary.
2121

22-
For CCS v5.0, we offer two binaries in bioconda:
22+
For _ccs_, we offer two binaries in bioconda:
2323

2424
* `ccs`, statically links `glibc` v2.32 and `mimalloc` v1.3.0.
2525
* `ccs-alt`, was build by dynamically linking `glibc` v2.12, but statically links `mimalloc` v1.3.0.

docs/faq/chemistry.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,32 @@ layout: default
33
parent: FAQ
44
title: Chemistry
55
---
6+
## Supported chemistries
7+
The latest _ccs_ v6 supports following combinations of binding and
8+
sequencing kit part numbers:
9+
10+
| BindingKit | SequencingKit | Chemistry | System |
11+
| :---------: | :-----------: | :--------------: | :-------: |
12+
| 101-500-400 | 101-427-500 | S/P3-C3/5.0 | Sequel |
13+
| 101-500-400 | 101-427-800 | S/P3-C3/5.0 | Sequel |
14+
| 101-500-400 | 101-646-800 | S/P3-C3/5.0 | Sequel |
15+
| 101-490-800 | 101-490-900 | S/P3-C1/5.0-8M | Sequel II |
16+
| 101-490-800 | 101-491-000 | S/P3-C1/5.0-8M | Sequel II |
17+
| 101-490-800 | 101-644-500 | S/P3-C1/5.0-8M | Sequel II |
18+
| 101-490-800 | 101-717-100 | S/P3-C1/5.0-8M | Sequel II |
19+
| 101-717-300 | 101-644-500 | S/P3-C1/5.0-8M | Sequel II |
20+
| 101-717-300 | 101-717-100 | S/P3-C1/5.0-8M | Sequel II |
21+
| 101-717-400 | 101-644-500 | S/P3-C1/5.0-8M | Sequel II |
22+
| 101-717-400 | 101-717-100 | S/P3-C1/5.0-8M | Sequel II |
23+
| 101-789-500 | 101-789-300 | S/P4-C2/5.0-8M | Sequel II |
24+
| 101-820-500 | 101-789-300 | S/P4.1-C2/5.0-8M | Sequel II |
25+
| 101-789-500 | 101-826-100 | S/P4-C2/5.0-8M | Sequel II |
26+
| 101-789-500 | 101-820-300 | S/P4-C2/5.0-8M | Sequel II |
27+
| 101-820-500 | 101-826-100 | S/P4.1-C2/5.0-8M | Sequel II |
28+
| 101-820-500 | 101-820-300 | S/P4.1-C2/5.0-8M | Sequel II |
29+
| 101-894-200 | 101-826-100 | S/P5-C2/5.0-8M | Sequel II |
30+
| 101-894-200 | 101-789-300 | S/P5-C2/5.0-8M | Sequel II |
31+
| 101-894-200 | 101-820-300 | S/P5-C2/5.0-8M | Sequel II |
632

733
## Help! I am getting "Unsupported ..."!
834
If you encounter the error `Unsupported chemistries found: (...)` or

docs/faq/low-complexity.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ parent: FAQ
44
title: Low complexity
55
---
66

7-
## Does CCS dislike low-complexity regions?
7+
## Does _ccs_ dislike low-complexity regions?
88
Low-complexity comes in many shapes and forms.
99
A particular challenge for _ccs_ are highly enriched tandem repeats, like
1010
hundreds of copies of `AGGGGT`.
@@ -13,7 +13,7 @@ a consensus sequence.
1313
Since _ccs_ v5.0, every ZMW is tested if it contains a tandem repeat
1414
of length `--min-tandem-repeat-length 1000`.
1515
For this, we use [symmetric DUST](https://doi.org/10.1089/cmb.2006.13.1028)
16-
and in particular this [sdust](https://github.com/lh3/sdust) implementation,
16+
and in particular the [sdust](https://github.com/lh3/sdust) implementation,
1717
but slightly modified.
1818
If a ZMW is flagged as a tandem repeat, internally `--disable-heuristics`
1919
is activated for only this ZMW, and various filters that are known to exclude

docs/faq/mode-all.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ Similar to the CLR instrument mode, in which subreads are accompanied by
1111
a scraps file, _ccs_ offers a new mode to never lose a single read due to
1212
filtering, without massive run time increase by polishing low-pass productive ZMWs.
1313

14-
Starting with SMRT Link v10.0 and Sequel IIe, _ccs_ v5.0 is able to generate
14+
Starting with SMRT Link v10.0 and Sequel IIe, _ccs_ v5.0 or newer is able to generate
1515
one representative sequence per productive ZMW, irrespective of quality and passes.
1616
This ensures no yield loss due to filtering and enables users to have maximum
1717
control over their data. Never fear again that SMRT Link or the Sequel IIe

docs/faq/performance.md

Lines changed: 65 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -4,26 +4,71 @@ parent: FAQ
44
title: Performance
55
---
66

7-
## How fast is CCS?
8-
We tested CCS runtime using 500 ZMWs per length bin with exactly 7 passes.
7+
## How fast is _ccs_?
8+
### Latest version
9+
The latest _ccs_ v6 can process 200 GBases HiFi yield in 24 hours for a 25 KBases
10+
library on 2x64 cores at 2.4 GHz.
11+
To put this into perspective for actual sequencing collections:
12+
13+
| Sample | Insert size | HiFi Yield | Run Time |
14+
| :------: | :---------: | :---------: | :------: |
15+
| HG002 | 15 KBases | 41.1 GBases | 5h 52m |
16+
| HG002 | 18 KBases | 34.0 GBases | 4h 36m |
17+
| Readwood | 25 KBases | 32.4 GBases | 3h 46m |
18+
19+
### Relative performance v3.0 to v6.0
20+
Current _ccs_ v6 achieves a >150x speed-up for 20 KBases inserts compared to
21+
v3.0 from SMRT Link 6.0 release in 2018.
22+
23+
### Algorithmic complexity
24+
To understand how this performance gain was possible, an overview how we changed
25+
the algorithmic complexity and how _ccs_ scales with insert size and number of passes:
26+
27+
| CCS version | O(insert size) | O(#passes) |
28+
| :---------: | :------------: | :-----------: |
29+
| ≤3.0.0 | quadratic | linear |
30+
| 3.4.1 | **linear** | linear |
31+
| ≥4.0.0 | linear | **sublinear** |
32+
33+
To visualize this table, we benchmarked runtime using 500 ZMWs per length bin with
34+
exactly 7 passes.
935

1036
<img width="1000px" src="../img/runtime.png"/>
1137

12-
### How does that translate into time to result per SMRT Cell?
13-
We will measure time to result for Sequel System and Sequel System II CCS sequencing collections
14-
on a PacBio recommended HPC, according to the
15-
[Sequel II System Compute Requirements](https://www.pacb.com/wp-content/uploads/SMRT_Link_Installation_v701.pdf)
16-
with 192 physical or 384 hyper-threaded cores.
38+
After v4.0.0, the slope of the curve does not change, as the complexity class
39+
hasn't changed; only improvements independent of input type were made.
40+
41+
### Performance comparisons
42+
Performance comparisons on different libraries; the `faster` column is with
43+
respect to the run time of the previous version. All runs were performed on the
44+
same hardware with 256 threads. A major part of the speed increase in v5.0 is
45+
due to toolchain improvements for generating a more optimized binary.
46+
#### **HG002 15kb SQII, 41 GBases HiFi yield**
1747

18-
1) Sequel System: 15 kb insert size, 24-hours movie, 37 GB raw yield, 2.3 GB HiFi UMY
19-
2) Sequel II System: 15 kb insert size, 30-hours movie, 340 GB raw yield, 24 GB HiFi UMY
48+
| CCS Version | HiFi Reads | Run Time | CPU Time | Peak RSS | Faster |
49+
| :---------: | :--------: | :------: | :------: | :------: | :----: |
50+
| 4.0.0 | 2,765,431 | 13h 14m | 89d 13h | 71 GB | |
51+
| 4.2.0 | 2,806,886 | 10h 47m | 61d 9h | 72 GB | 18% |
52+
| 5.0.0 | 2,807,317 | 6h 44m | 62d 22h | 27 GB | 37% |
53+
| 6.0.0 | 2,831,192 | 5h 52m | 44d 17h | 20 GB | 13% |
2054

21-
| CCS version | Sequel System | Sequel II System |
22-
| :-: | :-: | :-: |
23-
| ≤3.0.0 | 1 day | >1 week |
24-
| 3.4.1 | 3 hours | >1 day |
25-
| 4.0.0 | 40 minutes | 6 hours |
26-
| ≥4.2.0 | **30 minutes** | **4 hours** |
55+
#### **HG002 18kb SQII, 32 GBases HiFi yield**
56+
Omitting v4.0.0, due to lack of chemistry support.
57+
58+
| CCS Version | HiFi Reads | Run Time | CPU Time | Peak RSS | Faster |
59+
| :---------: | :--------: | :------: | :------: | :------: | :----: |
60+
| 4.2.0 | 1823016 | 8h 35m | 47d 13h | 80 GB | |
61+
| 5.0.0 | 1824206 | 5h 29m | 50d 16h | 46 GB | 36% |
62+
| 6.0.0 | 1855604 | 4h 36m | 30d 13h | 18 GB | 15% |
63+
64+
#### **Redwood 25kb SQII, 32 GBases HiFi yield**
65+
66+
| CCS Version | HiFi Reads | Run Time | CPU Time | Peak RSS | Faster |
67+
| :---------: | :--------: | :------: | :------: | :------: | :----: |
68+
| 4.0.0 | 1,269,680 | 7h 58m | 60d 19h | 72 GB | |
69+
| 4.2.0 | 1,310,775 | 6h 37m | 43d 18h | 74 GB | 17% |
70+
| 5.0.0 | 1,311,693 | 4h 36m | 41d 13h | 41 GB | 30% |
71+
| 6.0.0 | 1,335,888 | 3h 56m | 25d 11h | 17 GB | 14% |
2772

2873
### How is CCS speed affected by raw base yield?
2974
Raw base yield is the sum of all polymerase read lengths.
@@ -39,14 +84,6 @@ ZMWs per SMRT Cell.
3984
Starting with version 3.3.0 _ccs_ scales linear in (2) the polymerase read length
4085
and with version 4.0.0 _ccs_ scales sublinear.
4186

42-
### What did change in each version?
43-
44-
| CCS version | O(insert size) | O(#passes) |
45-
| :-: | :-: | :-: |
46-
| ≤3.0.0 | quadratic | linear |
47-
| 3.4.1 | **linear** | linear |
48-
| ≥4.0.0 | linear | **sublinear** |
49-
5087
### How can version 4.0.0 be sublinear in the number of passes?
5188
With the introduction of improved heuristics, individual draft bases can skip
5289
polishing if they are of sufficient quality.
@@ -57,13 +94,13 @@ No, we optimized _ccs_ such that there is a good balance between speed and
5794
output quality.
5895

5996
## Does speed impact quality and yield?
60-
Yes it does. With ~35x speed improvements from version 3.1.0 to 4.0.0 and
61-
consequently reducing CPU time from >60,000 to <2,000 core hours,
62-
heuristics and changes in algorithms lead to slightly lower yield and
97+
Yes it does. With >150x speed improvements from version 3.0 to 6.0,
98+
heuristics and changes in algorithms lead to slightly different yield and
6399
accuracy if run head-to-head on the same data set. Internal tests show
64-
that _ccs_ 4.0.0 introduces no regressions in CCS-only Structural Variant
100+
that _ccs_ 6.0 introduces no regressions in _ccs_-only Structural Variant
65101
calling and has minimal impact on SNV and indel calling in DeepVariant.
66-
In contrast, lower DNA quality has a bigger impact on quality and yield.
102+
In contrast, lower DNA quality and sample preparation has a bigger impact
103+
on quality and yield.
67104

68105
## Can I tune performance without sacrificing output quality?
69106
The bioconda _ccs_ ≥v5.0 binaries statically link [mimalloc](https://github.com/microsoft/mimalloc).

docs/faq/reports-aux-files.md

Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -41,6 +41,28 @@ The following comments refer to the filters that are explained in the FAQ above.
4141
If run in `--by-strand` mode, rows may contain half ZMWs, as we account
4242
each strand as half a ZMW.
4343

44+
### Coverage drops
45+
Example for a coverage drop in a single ZMW, subreads colored by strand orientation:
46+
47+
<p align="center">
48+
<img width="500px" src="../img/coveragedrop.png" />
49+
</p>
50+
51+
During sequencing of the molecule, one strand exhibits 744 more bases than its
52+
reverse complemented strand. What happened?
53+
Either there is a gain or loss of information.
54+
An explanation for loss of information could be that a secondary structure,
55+
the 744 bp forming a hairpin, could affect the replication during PCR and lead
56+
to loss of bases.
57+
Gain of information could also happen during PCR, when the polymerase gets stuck
58+
and incorporates the current base too often.
59+
In this example, there is a homopolymer of 744 `A` bases.
60+
While it might be obvious to a human eye what happened,
61+
its not the responsibility of _ccs_ to interpret and recover molecular damage.
62+
Even if there were a low-complexity filter for those regions, setting the
63+
appropriate threshold would be arbitrary;
64+
would a 10bp homopolymer insertion be valid, but 11bp would get discarded?
65+
4466
## How do I read the zmw_metrics.json file?
4567
Per default, each _ccs_ run generates a `<outputPrefix>.zmw_metrics.json.gz` file.
4668
Change file name with `--metrics-json`.

docs/how-does-ccs-work.md

Lines changed: 11 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -35,22 +35,24 @@ To avoid improper mappings, short subreads are excluded.
3535
The polish stage iteratively improves upon a candidate template sequence.
3636
Because polishing is very compute intensive, it is desirable to start with a
3737
template that is as close as possible to the true sequence of the molecule to
38-
reduce the number of iterations until convergence.
39-
So, the _ccs_ software does not pick a full-length subread as the initial
40-
template to be polished, but instead generates an approximate draft consensus
41-
sequence using graph algorithms like [partial-order alignment](https://academic.oup.com/bioinformatics/article/18/3/452/236691) (POA)
42-
[consensus](https://academic.oup.com/bioinformatics/article/19/8/999/235258),
43-
employing an accelerated implementation called [SPOA](https://github.com/rvaser/spoa),
44-
or our own alignment graph consensus caller, called pbdagcon.
38+
reduce the number of iterations until convergence. The _ccs_ software does
39+
not pick a full-length subread as the initial template to be polished, but
40+
instead generates an approximate draft consensus sequence using our improved
41+
implementation of the [Sparc graph consensus algorithm](https://doi.org/10.7717/peerj.2016).
42+
This algorithm depends on a subread to backbone alignment that is generated
43+
by our own mapper [pancake](https://github.com/PacificBiosciences/pancake)
44+
using [edlib](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5408825/) as the core
45+
aligner.
4546
Typically, subreads have accuracy of around 90% and the draft consensus has a
46-
higher accuracy, but depending on the algorithm employed is still below 99%.
47+
higher accuracy, but is still below 99%.
4748

4849
<p align="center"><img width="1000px" src="img/draft.png"/></p>
4950

5051
Stop if draft length is shorter than `--min-length` and longer than `--max-length`.
5152

5253
## 3. Alignment
53-
Align subreads to the draft consensus for downstream windowing and filtering.
54+
Align subreads to the draft consensus using pancake with
55+
[KSW2](https://github.com/lh3/ksw2) for downstream windowing and filtering.
5456

5557
## 4. Windowing
5658
Divide the the subread-to-draft alignment into overlapping windows with a target

docs/img/coveragedrop.png

27.6 KB
Loading

docs/img/run-design-kinetics.png

29.8 KB
Loading

0 commit comments

Comments
 (0)