Skip to content

algbio/ChainX

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ChainX

Colinear chain between two sequences showing overlap and gap costs

Fork of at-cg/ChainX, a tool that computes co-linear chaining costs between an input target and query sequences. It supports global and semi-global comparison modes. The new features in this fork are:

  • optimal chaining (with option --optimal), since ChainX can output a suboptimal chain in a small number of edge cases;
  • non-constant magic numbers to minimize the iterations of the chaining algorithm (the original behaviour can be restored with --originalmagicnumbers).

Dependencies / External libraries

ChainX repository uses many third-party libraries. These are separately provided in ext folder.

Additionally, the experimental tests requires awk, wget, seqtk, and GNU octave.

Installation

git clone https://github.com/nrizzo/ChainX.git
cd ChainX
make

Usage

SYNOPSIS
        ./chainX [-l <length>] [-a (MEM|MUM)] [--all2all] [--naive] [--optimal]
                 [--originalmagicnumbers] -m (g|sg) -q <qpath> -t <tpath>

OPTIONS
        <length>    minimum anchor match length (default = 20)
        MEM|MUM     anchor type (default = MUM)
        --all2all   output all to all global distances among query sequences in phylip format
        --naive     use slow 2d dynamic programming algorithm for correctness check
        --optimal   compute optimal chain by using diagonal distance
        --originalmagicnumbers
                    use original magic numbers instead of inverse query coverage times 1.5 and
                    ramp-up factor of 4

        g|sg        distance function (e.g., global or semi-global)
        <qpath>     query sequences in fasta or fastq format
        <tpath>     target sequence in fasta format

Example

Test data can be accessed from data folder. Here is an example run.

$ ./chainX -m g -q data/time_global/mutated_80_perc.fasta -t data/time_global/Chromosome_2890043_3890042_0.fasta --optimal
INFO, chainx::parseandSave, target sequence file = data/time_global/Chromosome_2890043_3890042_0.fasta
INFO, chainx::parseandSave, query sequences file = data/time_global/mutated_80_perc.fasta
INFO, chainx::parseandSave, mode = g
INFO, chainx::parseandSave, computing optimal chain with diagonal distance
INFO, chainx::parseandSave, anchor : minimim length = 20, type = MUM
INFO, chainx::main, read 1 queries, 999760 residues
INFO, chainx::main, read target, 1000000 residues

INFO, chainx::main, timer set
INFO, chainx::main, suffix array computed in 0.137535 seconds

INFO, chainx::main, timer reset
INFO, chainx::main, count of anchors (including dummy) = 2964, average length = 24.0651
INFO, chainx::main, query #0 (999760 residues), (71302 coverage, 1392687 initial guess), distance = (1 iterations) 939635
INFO, chainx::main, distance computation finished (0.255348 seconds elapsed)

Experiments

Command make test:

  • replicates the ChainX experiments and compares the original algorithm (--originalmagicnumbers) to the new optimal version (--originalmagicnumbers --optimal) and the final optimal version (--optimal);
  • runs ChainX and the final optimal version (--optimal) in semi-global mode on 100K PacBio HiFi reads on the on the T2T-CHM13 reference, requiring ~25GB of disk space and ~50GB of RAM.

In the experiment on human long reads, the optimal version has comparable speed to original ChainX and finds a better chain for 2.30% reads (more details in the paper).

Box plot of absolute and relative improvement of optimal ChainX

Publications

About

Anchor chaining method for nucleotide sequences

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • C++ 45.4%
  • C 32.0%
  • TeX 9.2%
  • JavaScript 7.7%
  • BibTeX Style 1.8%
  • Roff 1.3%
  • Other 2.6%