-
Notifications
You must be signed in to change notification settings - Fork 24
de Bruijn Graph
The core data structure of McCortex is a de Bruijn Graph, which is a graph where nodes are kmers (words of length k) and edges exist between all kmers which overlap by k-1 characters. Kmer size k is fixed for any given graph (more on kmer size). The kmers in genome assembly use bases (i.e. the DNA letters A,C,G,T).
An example with k=5:
ACCAT -> CCATC
which has two nodes (ACCAT,CCATC) and one edge between them. Traversing this graph from left to right we assemble the sequence ACCATC. From right to left we assemble the reverse complement GATGGT.
Each node in the McCortex de Bruijn graph represents both the kmer and is reverse complement. We refer to this representation as the kmer-key (lexically lowest of the two). Nodes can then be traversed with an orientation FORWARD (the kmer-key) or REVERSE (the reverse complement of the kmer-key).
By only allowing odd values of parameter k, a kmer can never be its own reverse complement. This is why --kmer-size parameter must always be odd.
There are various ways to represent a de Bruijn graph in memory, the most obvious being a hash table where the key is the kmer-key. More compact representations have been developed and present different pay offs in speed and required memory.
- wikipedia
- Nature review paper on using de Bruijn graphs in genome assembly