Commit 2318975
committed
Further VCF reading speeds optimisations.
This is the main meat of the VCF read speedup, following on from the
previous code refactoring. Combined timings on testing GNOMAD very
INFO heavy single-sample file, a many-sample (approx 4000) FORMAT rich
file for different compilers, and the GIAB HG002 VCF truth set are:
INFO heavy (15-29% speedup) (34-39% speedup)
dev(s) PR(s) dev(s) PR(s)
clang13 6.29 5.34 2.84 1.85
gcc13 6.74 5.22 2.93 1.93
gcc7 7.96 5.65 3.25 1.98
FORMAT heavy (6-19% speedup) (18-22% speedup)
dev PR dev PR
clang13 9.17 8.58 5.45 4.48
gcc13 9.88 8.04 5.08 3.95
gcc7 9.12 8.33 4.87 3.98
GIAB HG002 (28-29% speedup) (33-37% speedup)
dev PR dev PR
clang13 12.88 9.30 5.12 3.29
gcc13 12.04 8.60 4.74 3.19
gcc7 12.87 9.37 5.32 3.34
(Tested on Intel Xeon) Gold 6142 and an AMD Zen4 respectively)
Bigger speedups (see first message in PR) were seen on some older
hardware.
Specific optimisations along with estimates of their benefit include,
in approximate order of writing / testing:
- Adding consts and caching of bcf_hdr_nsamples(h).
No difference on system gcc (gcc7) and clang13, but a couple percent
gain on gcc13.
- Remove the need for most calls to hts_str2uint by recognising that
most GT numbers are single digits. This was 4-5% saving for gcc and
9-10% on clang.
- Remove kputc calls in bcf_enc_vint / bcf_enc_size, avoiding repeated
ks_resize checking. This is a further ~10% speedup.
- Unrolling in bcf_enc_vint to encourage SIMD.
- Improve speed of bgzf_getline and kstrrok via memchr/strchr.
In tabix timings indexing VCF, bgzf_getline change is 9-22% quicker
with clang 13 and 19-25% quicker with gcc 7. I did investigate a
manually unrolled 64-bit search, before I remembered the existance of
memchr (doh!). This is often faster on clang (17-19%), but marginally
slower on gcc. The actual speed up to this function however is
considerably more (3-4x quicker).
For interest, I include the equivalent code here, as it may be useful in
other contexts:
#if HTS_ALLOW_UNALIGNED != 0 && ULONG_MAX == 0xffffffffffffffff
// 64-bit unrolled delim detection
#define haszero(x) (((x)-0x0101010101010101UL)&~(x)&0x8080808080808080UL)
// Quicker broadcast on clang than bit shuffling in delim
union {
uint64_t d8;
uint8_t d[8];
} u;
memset(u.d, delim, 8);
const uint64_t d8 = u.d8;
uint64_t *b8 = (uint64_t *)(&buf[fp->block_offset]);
const int l8 = (fp->block_length-fp->block_offset)/8;
for (l = 0; l < (l8 & ~3); l+=4) {
if (haszero(b8[l+0] ^ d8))
break;
if (haszero(b8[l+1] ^ d8)) {
l++;
break;
}
if (haszero(b8[l+2] ^ d8)) {
l+=2;
break;
}
if (haszero(b8[l+3] ^ d8)) {
l+=3;
break;
}
}
l *= 8;
for (l += fp->block_offset;
l < fp->block_length && buf[l] != delim;
l++);
The analogous kstrtok change is using strchr+strlen instead of memchr
as we don't know the string end. This makes kstrtok around 150%
quicker when parsing a single sample VCF.
When not finding aux->sep in the string, strchr returns NULL rather
than end of string, so we need an additional strlen to set aux->p.
However there is also glibc's strchrnul which solves this in a single
call. This makes kstrtok another 40% quicker on this test, but
overall it's not a big bottleneck any more.
- Use strchr in vcf_parse_info.
This is a major speed increase over manual searching on Linux.
TODO: is this just glibc? Eg libmusl speeds, etc? Other OSes?
It saves about 33% of time in vcf_parse (vcf_parse_info inlined to it)
with gcc. Even more with clang. The total speed gain on a single
sample VCF view (GIAB truth set) is 12-19% fewer cycles:
- Minor "GT" check improvement. This has no real affect on gcc13 and
clang13, but the system gcc (gcc7) speeds up single sample VCF decoding by 7%
- Speed up the unknown value check (strcmp(p, "."). Helps gcc7 the
most (9%), with gcc13/clang13 in the 3-4% gains.
- Speed up vcf_parse_format_max3.
This is the first parse through the FORMAT fields. Ideally we'd merge
this and fill5 (the other parse through), but that is harder due to
the data pivot / rotate.
For now we just optimise the existing code path. Instead of a
laborious switch character by character, we have an initial tight loop
to find the first meta-character and then a switch to do char
dependant code.
This is 5% to 13% speed up depending on data set.
- Remove kputc and minimise resize for bcf_enc_int1.
3-8% speedup depending on data / compiler.
- Use memcmp instead of strcmp for "END" and ensure we have room.
Also memset over explicit nulling of arrays.
- Force BCF header dicts to be larger than needed.
This is a tactic to reduce hash collisions due to the use of overly
simple hash functions. It seems to typically be around 3-8% speed gain.
- Restructure of main vcf_parse function.
This can speed things up by 6-7% on basic single-sample files.
The previous loop caused lots of branch prediction misses due to the
counter 'i' being used to do 8 different parts of code depending on
token number. Additionally it's got better error checking now as
previously running out of tokens early just did a return 0 rather than
complaining about missing columns.1 parent 7c1d3cc commit 2318975
4 files changed
+476
-189
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2280 | 2280 | | |
2281 | 2281 | | |
2282 | 2282 | | |
2283 | | - | |
| 2283 | + | |
| 2284 | + | |
| 2285 | + | |
| 2286 | + | |
| 2287 | + | |
| 2288 | + | |
| 2289 | + | |
2284 | 2290 | | |
2285 | 2291 | | |
2286 | 2292 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1522 | 1522 | | |
1523 | 1523 | | |
1524 | 1524 | | |
1525 | | - | |
1526 | | - | |
1527 | | - | |
1528 | | - | |
1529 | | - | |
1530 | | - | |
1531 | | - | |
1532 | | - | |
1533 | | - | |
1534 | | - | |
1535 | | - | |
1536 | | - | |
1537 | | - | |
1538 | | - | |
| 1525 | + | |
| 1526 | + | |
| 1527 | + | |
| 1528 | + | |
| 1529 | + | |
| 1530 | + | |
| 1531 | + | |
| 1532 | + | |
| 1533 | + | |
| 1534 | + | |
| 1535 | + | |
| 1536 | + | |
| 1537 | + | |
| 1538 | + | |
| 1539 | + | |
| 1540 | + | |
| 1541 | + | |
| 1542 | + | |
| 1543 | + | |
| 1544 | + | |
| 1545 | + | |
| 1546 | + | |
| 1547 | + | |
| 1548 | + | |
1539 | 1549 | | |
1540 | | - | |
1541 | | - | |
| 1550 | + | |
| 1551 | + | |
| 1552 | + | |
1542 | 1553 | | |
1543 | | - | |
1544 | | - | |
| 1554 | + | |
| 1555 | + | |
1545 | 1556 | | |
1546 | 1557 | | |
1547 | 1558 | | |
| |||
1553 | 1564 | | |
1554 | 1565 | | |
1555 | 1566 | | |
1556 | | - | |
1557 | | - | |
| 1567 | + | |
| 1568 | + | |
| 1569 | + | |
| 1570 | + | |
1558 | 1571 | | |
1559 | | - | |
1560 | | - | |
| 1572 | + | |
| 1573 | + | |
| 1574 | + | |
| 1575 | + | |
| 1576 | + | |
1561 | 1577 | | |
1562 | | - | |
1563 | | - | |
| 1578 | + | |
| 1579 | + | |
| 1580 | + | |
1564 | 1581 | | |
1565 | | - | |
1566 | | - | |
| 1582 | + | |
| 1583 | + | |
| 1584 | + | |
1567 | 1585 | | |
1568 | | - | |
1569 | | - | |
1570 | | - | |
| 1586 | + | |
| 1587 | + | |
| 1588 | + | |
1571 | 1589 | | |
1572 | | - | |
1573 | | - | |
1574 | | - | |
| 1590 | + | |
| 1591 | + | |
| 1592 | + | |
1575 | 1593 | | |
1576 | | - | |
| 1594 | + | |
| 1595 | + | |
1577 | 1596 | | |
1578 | 1597 | | |
1579 | 1598 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
204 | 204 | | |
205 | 205 | | |
206 | 206 | | |
207 | | - | |
208 | | - | |
| 207 | + | |
| 208 | + | |
| 209 | + | |
| 210 | + | |
| 211 | + | |
| 212 | + | |
| 213 | + | |
| 214 | + | |
| 215 | + | |
| 216 | + | |
| 217 | + | |
209 | 218 | | |
210 | 219 | | |
211 | 220 | | |
| |||
0 commit comments