-
Notifications
You must be signed in to change notification settings - Fork 458
Description
TODO: Fix the 28-bit limit on CIGAR length size in the internal memory structures.
While we have no immediate plans to support reads several Gb long, we do support references this long now.
However exome or tRNA style alignments (eg tophat) can produce long reference alignments by the addition of huge N ref-skip ops. Such alignments would be rare to be over 1Gb in genuine cases, but there is always room for badly mapped data and errors, plus there is potential for this to happen with circular genomes.
An example:
@SQ SN:LONG LN:8000000000
SRR065390.14978392 16 LONG 2 1 27M7000000000N73M * 0 0 CCTAGCCCTAACCCTAACCCTAACCCTAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA #############################@B?8B?BA@@DDBCDDCBC@CDCDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
Samtools currently states:
$ samtools view ~/tmp/long.sam
[E::sam_parse1] numeric value out of allowed range
[W::sam_read1] Parse error at line 2
[main_samview] truncated file.
We can try chopping it up, but even 1000000000N1000000000N1000000000N1000000000N1000000000N1000000000N1000000000N doesn't work due to the 28-bit size limit in BAM's cigar handling. The internals of htslib shouldn't be so wedded to the BAM on disk format and ought to be using more than 28-bits. The SAM parser at least can swallow this if we go below 28-bits. eg:
@SQ SN:LONG LN:8000000000
SRR065390.14978392 16 LONG 2 1 27M200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N200000000N73M * 0 0 CCTAGCCCTAACCCTAACCCTAACCCTAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA #############################@B?8B?BA@@DDBCDDCBC@CDCDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
It can be written to BAM this way. Region queries pull this out OK. Also samtools mpileup -r LONG:7000000000-8000000000 ~/tmp/long3.bam works (after a suitably long pregnant pause).
There are issues with the CRAM decoder though:
$ samtools view ~/tmp/long3.cram
SRR065390.14978392 16 LONG 2 1 27M20678144N73M * 0 0 CCTAGCCCTAACCCTAACCCTAACCCTAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAAGCCTAA #############################@B?8B?BA@@DDBCDDCBC@CDCDCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
I think this is actually the same bug in a different guise. CRAM is combining multiple neighbouring op codes together, but in doing so is overflowing the internal 28-bit limit.