Skip to content
Dan Katzel edited this page Jul 15, 2017 · 2 revisions

Working With Fastq Data

Jillion can read, write and modify fastq data and files in all encoding formats.

Quality Codecs

Jillion currently supports the 3 main ways to encode quality data. The enum FastqQualityCodec encapsulates all the algorithms to encode and decode each format.

public enum FastqQualityCodec{ 
        SANGER,        
        //supports Illumina 1.3+
	ILLUMINA, 
        //uses non-phred quality scale
        SOLEXA 
}

FastqQualityCodec is only used when reading or writing out fastq data. Once the Jillion object representation of the fastq reads have been created, the quality data is fetched as Phred quality scores.

FastqRecord

The FastqRecord interface is used to represent an individual "read" in a fastq file. The FastqRecord is quality format independent so users are free to mix and match FastqRecord objects that were from differently encoded fastq files.

public interface FastqRecord extends Trace{
 
    String getId();
 
    String getComment()
 
    NucleotideSequence getNucleotideSequence();
 
    QualitySequence getQualitySequence(); 
 
    long getLength(); 
 
}

Jillion 5 added a new method getLength() which returns the number of bases in the FastqRecord. It will return the same value as fastqRecord.getNucleotideSequence().getLength() or fastqRecord.getQualitySequence().getLength().

Reading Fastq Files

FastqParser

The FastqParser handles actually parsing the fastq input data from either files or Inputstreams. It is configurable to support multi-line sequence records and comments on fastq deflines. By default, these are turned off since majority of fastq records are only 1 line per sequence and don't have comments and it parsing can be made faster if the parser doesn't have to worry about parsing comments or reading a variable number of lines per record.

To create a parser with the default options of a normal 4 line per record fastq with no comments:

FastqParser parser = FastqFileParser.create(file);

FastqParser parser = FastqFileParser.create(inputstream);

If the file extension ends in .zip or .gz then Jillion can automatically unzip it accordingly.

To make a configurable Parser object, use the FastqParserBuilder. The Builder has several configuration option methods to turn on or off various parsing options.

FastqParser parser = new FastqFileParserBuilder(file)
						.hasComments(true)
						.hasMultilineSequences(true)
						.build();

Low Level Visitor Support

The FastqParser uses the Visitor Pattern to traverse the fastq data and visit each record. Using a visitor gives the programmer full control over what parts of the input data are parsed but it is very low level. Jillion provides high level DataStore and java 8 Stream and forEach implementations that are built on top of Visitor objects to hide the low level details regarding fastq file encodings. These will be discussed in the next section.

Quality Encoding Can Be Autodetected

If the quality encoding of a fastq file is not known, it can be autodetected for a performance penalty. Most FastqReaders and DataStore Builder objects are able to read the whole file twice: the first time analyzing the encoded quality values of all the records to determine the most likely quality encoding used. This technique can also detect when a file contains records that were encoded differently which is often a sign of a bug in a bioinformatics pipeline that generated fastq data from multiple inputs.

It is recommended that the quality encoding is always given if known, since the extra parsing and analysis to determine the quality encoding comes at a performance penalty.

Looping over a fastq file using ForEach

If each record in a fastq file must be processed only once, the easiest way is to use the convenience FastqFileReader.forEach( ...) static methods. There are several overloaded flavors of forEach that take different input parameters in different combinations. All forEach methods takes as the last parameter Jillion version of a Java 8 BiConsumer<String, FastqRecord>. This BiConsumer will be called for each record that meets the parsing criteria. The first parameter will be the record's ID as a String, the second parameter will be the actual FastqRecord object.

FastqFileReader.forEach( file, FastqQualityCodec.SANGER, 
                             (id, record) -> {
                                  ... //do stuff
                             });


//or using FastqParser
FastqParser parser = ...
FastqFileReader.forEach( parser, FastqQualityCodec.ILLUMINA, 
                             (id, record) -> {
                                  ... //do stuff
                             });

The BiConsumer used is actually a Jillion ThrowingBiConsumer which allows the lambda to throw checked exceptions where as the normal Java 8 version can not throw checked exceptions requiring more boilerplate to try and catch exceptions thrown.

Using a FastqFileDataStore

A FastqFileDataStore is a Jillion DataStore that wraps an input fastq file. There is an additional method to get the quality codec that was used to parse the datastore. This is helpful if the program needs to write out new fastq files using the same codec.

FastqFileDataStoreBuilder object uses the same implementation hints to determine how to actually store the fastq data for best random access vs memory vs time as descibed in The DataStore page

File gzippedFastq = new File("my.fastq.gz");
 
try(FastqDataStore datastore = new FastqFileDataStoreBuilder(gzippedFastq)
                                   .qualityCodec(FastqQualityCodec.ILLUMINA)
				   .hint(DataStoreProviderHint.ITERATION_ONLY)
				   .build();

){
   FastqRecord record = datastore.get("myId");

} // autoclose datastore

FastqWriter

Example Write Multiline Fastq

 File fastqFile = new File("path/to/fastq");
File outFile = new File("path/to/output.fastq");   

Set<String> idsToInclude = new HashSet<>();//put names here     

 //builder can can auto-detect the quality encoding
//for us for a minor performance penalty.
   try(Results results = FastqFileReader.read(fastqFile, idsToInclude::contains);

 //writer uses same quality codec as the input fastq file uses as detected by datastore

    FastqWriter writer = new FastqWriterBuilder(outFile)
                                          .qualityCodec(results.getCodec())
                                          .basesPerLine(50)
                                          .build(); 
    ){
       results.forEach((id, record) -> writer.write(record));
     }
    } 

Clone this wiki locally