krepp is a k-mer-based maximum likelihood tool for estimating distances of reads to genomes and phylogenetic placement.
For the description of the method, refer to the main paper here.
See the Wiki for a detailed documentation, a list of available databases, and various tutorials.
The easiest way to install krepp is by using conda
conda install bioconda::krepp This will install the latest available version. Simply run krepp --help to test.
To compile from the source, clone the repository with its submodules (might take a while) and compile with
git clone --recurse-submodules -j8 https://github.com/bo1929/krepp.git
cd krepp && makeand run ./krepp --help. Then, perhaps, copy it to a directory you have in your $PATH (e.g., cp ./krepp ~/.local/bin).
You can build an index from scratch, using only the 25 genomes provided in test/, to familiarize yourself with krepp.
cd test/
tar -xvf references_toy.tar.gz && xz -d references_toy/*
krepp index -h 11 -k 27 -w 35 -o index_toy -i input_map.tsv -t tree_toy.nwk --num-threads 8This command took only a couple of seconds and used <1.5GB of memory to index 6,863,411 k-mers.
The resulting index will be stored in index_toy.
Alternatively, you could download one of the larger public libraries to make it more realistic and use it also for your novel query sequences.
Once you have your index (e.g., the one we built above: index_toy), you can estimate distance by running:
krepp dist -i index_toy -q query_toy.fq --num-threads 4 | tee distances_toy.tsvThe first five lines of distances_toy.tsv are going to look like:
#software: krepp #version: v0.6.0 #invocation :krepp dist -i index_toy -q query_toy.fq --num-threads 4
SEQ_ID REFERENCE_NAME DIST
||61435-4122 G000341695 0.0898062
||61435-4949 G000830905 0.147048
||61435-4949 G000341695 0.0740587
||61435-4949 G000025025 0.131182
||61435-4949 G000741845 0.0395985
Quite similarly, you can place reads by running:
krepp --num-threads 8 place -i index_toy -q query_toy.fq | tee placements_toy.jplaceThe resulting placement file is a JSON file in a special format called jplace:
head -n20 placements_toy.jplace{
"version" : 3,
"fields" : ["edge_num", "pendant_length", "distal_length", "likelihood", "like_weight_ratio", "distance"],
"placements" : [
{"n" : ["||61435-4122"], "p" : [[39, 0.0986, 0.0011, -18.9251, 1.0000, 0.0933]]},
{"n" : ["||61435-4949"], "p" : [
[38, 0.0433, 0.0007, -31.8820, 0.2240, 0.0427],
[37, 0.0523, 0.0000, -38.5286, 0.1582, 0.0505],
[40, 0.0400, 0.0084, -35.2375, 0.1900, 0.0468],
[36, 0.0314, 0.0020, -24.9208, 0.2696, 0.0326],
[39, 0.0512, 0.0011, -38.5314, 0.1582, 0.0505]]
},
{"n" : ["||61435-317"], "p" : [
[38, 0.0058, 0.0007, -18.0543, 0.1837, 0.0065],
[37, 0.0060, 0.0000, -18.0878, 0.1844, 0.0060],
[40, -0.0021, 0.0084, -18.0759, 0.1842, 0.0062],
[36, 0.0052, 0.0020, -18.0116, 0.1812, 0.0071],
[39, 0.0049, 0.0011, -18.0924, 0.1844, 0.0060],
[41, -0.1333, 0.1497, -22.2532, 0.0821, 0.0162]]
},
Here, n field is for the read ID as it appeared in query_toy.fq, and p is for the placement information, with the following fields:
["edge_num", "pendant_length", "distal_length", "likelihood", "like_weight_ratio", "distance"]
At the end of the jplace file, you can find the phylogeny decorated with edge numbers, which corresponds to the first field of p.
You can proceed with your downstream analysis using other tools, such as gappa. e.g., by generating a heat tree, colored based on placement densities across the backbone tree:
gappa examine heat-tree --jplace-path placements_toy.jplace --write-svg-treeAlternatively, for a simpler format, give the --tabular flag. Then, the first 20 lines would look like:
# software: krepp version: v0.6.0 invocation :krepp --num-threads 8 place -i index_toy -q query_toy.fq --tabular
# (G001917855:0.4290{0},(G001918235:0.5280{1},(((((G000526415:0.1764{2},G001306135:0.2276{3})N1779:0.0365{4},(G000016665:0.1683{5},(G000735195:0.0276{6},G000018865:0.0229{7})N2640:0.1610{8})N1780:0.0609{9})N1532:0.1058{10},G000021685:0.3879{11})N1303:0.0337{12},(G002010545:0.3919{13},((G001050235:0.1711{14},G001306055:0.1635{15})N5461:0.1783{16},G001567105:0.2932{17})N2355:0.1933{18})N1304:0.0560{19})N1099:0.0288{20},(G000702505:0.1445{21},G001914715:0.1983{22})N1305:0.2353{23},(G001796575:0.3977{24},G001795015:0.4213{25},((G001796415:0.2756{26},(G002010445:0.3607{27},G001795205:0.3152{28})N3984:0.0458{29})N3292:0.0233{30},(((G000025025:0.0115{31},G001889305:0.0109{32})N4334:0.0068{33},G000830905:0.0305{34})N3987:0.0117{35},((G000741845:0.0039{36},G001610775:0.0001{37})N5905:0.0014{38},G000341695:0.0022{39})N4337:0.0167{40})N3634:0.2994{41})N2954:0.1317{42})N1788:0.1622{43})N916:0.0295{44})N736:0.0348{45})N432:0.0257{46};
SEQ_ID DISTAL_NODE EDGE_NUM LWR DIST
||61435-4122 G000341695 39 1.0000 0.0933
||61435-4949 G000741845 36 0.2696 0.0326
||61435-4949 N5905 38 0.2240 0.0427
||61435-4949 G000341695 39 0.1582 0.0505
||61435-4949 G001610775 37 0.1582 0.0505
||61435-4949 N4337 40 0.1900 0.0468
||61435-317 N3634 41 0.0821 0.0162
||61435-317 G000741845 36 0.1812 0.0071
||61435-317 N5905 38 0.1837 0.0065
||61435-317 G000341695 39 0.1844 0.0060
||61435-317 G001610775 37 0.1844 0.0060
||61435-317 N4337 40 0.1842 0.0062
||61435-2985 G000741845 36 0.2048 0.0158
||61435-2985 N5905 38 0.2023 0.0175
||61435-2985 G000341695 39 0.1966 0.0189
||61435-2985 G001610775 37 0.1966 0.0189
||61435-2985 N4337 40 0.1997 0.0183
@article{sapci_krepp_2026,
title = {krepp: a k-mer-based maximum pseudo-likelihood method for estimating read distances and genome-wide phylogenetic placement},
volume = {27},
issn = {1474-760X},
shorttitle = {krepp},
url = {https://doi.org/10.1186/s13059-026-03999-y},
doi = {10.1186/s13059-026-03999-y},
number = {1},
urldate = {2026-03-28},
journal = {Genome Biology},
author = {Şapcı, Ali Osman Berk and Mirarab, Siavash},
month = feb,
year = {2026},
keywords = {Average nucleotide identity, k-mer-based sequence comparison, Metagenomics, Phylogenetic placement},
pages = {108},
}