Skip to content
Dan Katzel edited this page Jul 12, 2017 · 2 revisions

Jillion

Jillion, The Java Informatics Large Library for Genomics, is an open source genomics software library written in Java to support bioinformatics. This library was created by a single Software Engineer at the J. Craig Venter Institute (JCVI) and used by several projects including The Influenza Genome Project, Leptospira Genome Project and The Human Microbiome Project and used in over 20,000 viral whole and draft genome submissions to Genbank.

In September 2015, The Viral Pathogen Database and Analysis Resource (ViPR) added a new Rotavirus Genotype Detection Tool written using Jillion and is over an order of magnitude faster than other similar webtools.

How is Jillion Different Than BioJava and Picard?

BioJava and Picard are other Java libraries for bioinformatics that are similar to Jillion. Each of these libraries support some common bioinformatic read formats such as FASTA and FASTQ but there the similarities end. BioJava focuses mainly on input reads and genome annotation where as Jillion focuses on genome assembly. Picard focuses mainly on SAM alignment data. Jillion supports not only input reads and alignments but also has object representations of contigs as well as parsers and writers for many common assembly file formats such as SAM/BAM and Consed's ACE format.

Sequence Support

Like BioJava, Jillion can handle various read input formats such as fasta, fastq, and scf encoded files, but Jillion can also natively handle other formats such as sff, ztr and abi chromatograms. Sequence objects also have different implementations depending on the use case and type of data. For example, a NucleotideSequence object which contains only the nucleotides A,C,G and T could represent each nucleotide as 2 bits each. A different implementation that stores each nucleotide as 4 bits would be used if the sequence contained ambiguous bases. Since quality sequences often have consecutive quality scores of the same value, a run length implementation can compactly store reads or even contig consensus qualities in only a few bytes.

Sanger Chromatogram Format read and write support comparisons to Jillion:

Format Version BioJava Read BioJava Write Picard Read Picard Write Jillion Read Jillion Write
Abi X
Ztr 1.2 X X X X
Scf 2 X X X
Scf 3 X X X

All the popular bioinformatics libraries can read write fasta and fastq files, but only Jillion supports sff files. Jillion has been tested on sff files produced by 454 and Ion Torrent:

Format Encoding BioJava Read BioJava Write Picard Read Picard Write Jillion Read Jillion Write
Fasta nucleotide X
Fasta protein X X
Fasta qualities X X X X
Fasta positions X X X X
Fasta index (fai) nuclotide X X
Fasta index (fai) protein X X
Fastq sanger/solexa/illumina Sanger only
sff X X X X
bfa (MAQ binary fasta) X X
bfq (MAQ) binary fastq) X X

Assembly Support

Jillion has objects that represent contigs produced by several assembler programs that are used internally by JCVI including Sam and Bam alignment files, Phrap/Consed .ace files, Celera Assembler .asm files and CLC Bio Assembly Cell .cas files among others. Each contig object not only has the contig consensus sequence but also includes all the underlying read information. Coupled with support for all the various read formats, it is possible to analyze, edit and write out new assembly files. Even though all the underlying read data is stored for each contig, memory usage is kept low. Nucleotide sequence objects for reads that have been assembled into a contig can be encoded to only store a pointer to the contig consensus sequence, the read's start offset into the consensus and any differences in the read sequence vs. the alignment to the contig consensus (if any). This greatly reduces the memory usage for storing underlying contig data since most reads in an assembly have a high identity to the consensus sequence and therefore, few differences.

Unlike BioJava and Picard, Jillion can read and write several different assembly output formats. The Jillion contig objects include the consensus sequence as well as all the underlying sequence read data.

Format BioJava Read BioJava Write Picard Read Picard Write Jillion Read Jillion Write
sam X X
bam X X
Phrap/Consed .ace X X X X
Celera .asm X X
CLC Bio .cas X X
TIGR .contig X X X X
TIGR .tasm X X X X

Funding

This work has been funded in whole or part with federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services under contract numbers HHSN272200900007C and U19AI110819.

Clone this wiki locally