Skip to content

Commit 2318975

Browse files
committed
Further VCF reading speeds optimisations.
This is the main meat of the VCF read speedup, following on from the previous code refactoring. Combined timings on testing GNOMAD very INFO heavy single-sample file, a many-sample (approx 4000) FORMAT rich file for different compilers, and the GIAB HG002 VCF truth set are: INFO heavy (15-29% speedup) (34-39% speedup) dev(s) PR(s) dev(s) PR(s) clang13 6.29 5.34 2.84 1.85 gcc13 6.74 5.22 2.93 1.93 gcc7 7.96 5.65 3.25 1.98 FORMAT heavy (6-19% speedup) (18-22% speedup) dev PR dev PR clang13 9.17 8.58 5.45 4.48 gcc13 9.88 8.04 5.08 3.95 gcc7 9.12 8.33 4.87 3.98 GIAB HG002 (28-29% speedup) (33-37% speedup) dev PR dev PR clang13 12.88 9.30 5.12 3.29 gcc13 12.04 8.60 4.74 3.19 gcc7 12.87 9.37 5.32 3.34 (Tested on Intel Xeon) Gold 6142 and an AMD Zen4 respectively) Bigger speedups (see first message in PR) were seen on some older hardware. Specific optimisations along with estimates of their benefit include, in approximate order of writing / testing: - Adding consts and caching of bcf_hdr_nsamples(h). No difference on system gcc (gcc7) and clang13, but a couple percent gain on gcc13. - Remove the need for most calls to hts_str2uint by recognising that most GT numbers are single digits. This was 4-5% saving for gcc and 9-10% on clang. - Remove kputc calls in bcf_enc_vint / bcf_enc_size, avoiding repeated ks_resize checking. This is a further ~10% speedup. - Unrolling in bcf_enc_vint to encourage SIMD. - Improve speed of bgzf_getline and kstrrok via memchr/strchr. In tabix timings indexing VCF, bgzf_getline change is 9-22% quicker with clang 13 and 19-25% quicker with gcc 7. I did investigate a manually unrolled 64-bit search, before I remembered the existance of memchr (doh!). This is often faster on clang (17-19%), but marginally slower on gcc. The actual speed up to this function however is considerably more (3-4x quicker). For interest, I include the equivalent code here, as it may be useful in other contexts: #if HTS_ALLOW_UNALIGNED != 0 && ULONG_MAX == 0xffffffffffffffff // 64-bit unrolled delim detection #define haszero(x) (((x)-0x0101010101010101UL)&~(x)&0x8080808080808080UL) // Quicker broadcast on clang than bit shuffling in delim union { uint64_t d8; uint8_t d[8]; } u; memset(u.d, delim, 8); const uint64_t d8 = u.d8; uint64_t *b8 = (uint64_t *)(&buf[fp->block_offset]); const int l8 = (fp->block_length-fp->block_offset)/8; for (l = 0; l < (l8 & ~3); l+=4) { if (haszero(b8[l+0] ^ d8)) break; if (haszero(b8[l+1] ^ d8)) { l++; break; } if (haszero(b8[l+2] ^ d8)) { l+=2; break; } if (haszero(b8[l+3] ^ d8)) { l+=3; break; } } l *= 8; for (l += fp->block_offset; l < fp->block_length && buf[l] != delim; l++); The analogous kstrtok change is using strchr+strlen instead of memchr as we don't know the string end. This makes kstrtok around 150% quicker when parsing a single sample VCF. When not finding aux->sep in the string, strchr returns NULL rather than end of string, so we need an additional strlen to set aux->p. However there is also glibc's strchrnul which solves this in a single call. This makes kstrtok another 40% quicker on this test, but overall it's not a big bottleneck any more. - Use strchr in vcf_parse_info. This is a major speed increase over manual searching on Linux. TODO: is this just glibc? Eg libmusl speeds, etc? Other OSes? It saves about 33% of time in vcf_parse (vcf_parse_info inlined to it) with gcc. Even more with clang. The total speed gain on a single sample VCF view (GIAB truth set) is 12-19% fewer cycles: - Minor "GT" check improvement. This has no real affect on gcc13 and clang13, but the system gcc (gcc7) speeds up single sample VCF decoding by 7% - Speed up the unknown value check (strcmp(p, "."). Helps gcc7 the most (9%), with gcc13/clang13 in the 3-4% gains. - Speed up vcf_parse_format_max3. This is the first parse through the FORMAT fields. Ideally we'd merge this and fill5 (the other parse through), but that is harder due to the data pivot / rotate. For now we just optimise the existing code path. Instead of a laborious switch character by character, we have an initial tight loop to find the first meta-character and then a switch to do char dependant code. This is 5% to 13% speed up depending on data set. - Remove kputc and minimise resize for bcf_enc_int1. 3-8% speedup depending on data / compiler. - Use memcmp instead of strcmp for "END" and ensure we have room. Also memset over explicit nulling of arrays. - Force BCF header dicts to be larger than needed. This is a tactic to reduce hash collisions due to the use of overly simple hash functions. It seems to typically be around 3-8% speed gain. - Restructure of main vcf_parse function. This can speed things up by 6-7% on basic single-sample files. The previous loop caused lots of branch prediction misses due to the counter 'i' being used to do 8 different parts of code depending on token number. Additionally it's got better error checking now as previously running out of tokens early just did a return 0 rather than complaining about missing columns.
1 parent 7c1d3cc commit 2318975

File tree

4 files changed

+476
-189
lines changed

4 files changed

+476
-189
lines changed

bgzf.c

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2280,7 +2280,13 @@ int bgzf_getline(BGZF *fp, int delim, kstring_t *str)
22802280
if (fp->block_length == 0) { state = -1; break; }
22812281
}
22822282
unsigned char *buf = fp->uncompressed_block;
2283-
for (l = fp->block_offset; l < fp->block_length && buf[l] != delim; ++l);
2283+
2284+
// Equivalent to a naive byte by byte search from
2285+
// buf + block_offset to buf + block_length.
2286+
void *e = memchr(&buf[fp->block_offset], delim,
2287+
fp->block_length - fp->block_offset);
2288+
l = e ? (unsigned char *)e - buf : fp->block_length;
2289+
22842290
if (l < fp->block_length) state = 1;
22852291
l -= fp->block_offset;
22862292
if (ks_expand(str, l + 2) < 0) { state = -3; break; }

htslib/vcf.h

Lines changed: 52 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -1522,26 +1522,37 @@ static inline int bcf_format_gt(bcf_fmt_t *fmt, int isample, kstring_t *str)
15221522

15231523
static inline int bcf_enc_size(kstring_t *s, int size, int type)
15241524
{
1525-
uint32_t e = 0;
1526-
uint8_t x[4];
1527-
if (size >= 15) {
1528-
e |= kputc(15<<4|type, s) < 0;
1529-
if (size >= 128) {
1530-
if (size >= 32768) {
1531-
i32_to_le(size, x);
1532-
e |= kputc(1<<4|BCF_BT_INT32, s) < 0;
1533-
e |= kputsn((char*)&x, 4, s) < 0;
1534-
} else {
1535-
i16_to_le(size, x);
1536-
e |= kputc(1<<4|BCF_BT_INT16, s) < 0;
1537-
e |= kputsn((char*)&x, 2, s) < 0;
1538-
}
1525+
// Most common case is first
1526+
if (size < 15) {
1527+
if (ks_resize(s, s->l + 1) < 0)
1528+
return -1;
1529+
uint8_t *p = (uint8_t *)s->s + s->l;
1530+
*p++ = (size<<4) | type;
1531+
s->l++;
1532+
return 0;
1533+
}
1534+
1535+
if (ks_resize(s, s->l + 6) < 0)
1536+
return -1;
1537+
uint8_t *p = (uint8_t *)s->s + s->l;
1538+
*p++ = 15<<4|type;
1539+
1540+
if (size < 128) {
1541+
*p++ = 1<<4|BCF_BT_INT8;
1542+
*p++ = size;
1543+
s->l += 3;
1544+
} else {
1545+
if (size < 32768) {
1546+
*p++ = 1<<4|BCF_BT_INT16;
1547+
i16_to_le(size, p);
1548+
s->l += 4;
15391549
} else {
1540-
e |= kputc(1<<4|BCF_BT_INT8, s) < 0;
1541-
e |= kputc(size, s) < 0;
1550+
*p++ = 1<<4|BCF_BT_INT32;
1551+
i32_to_le(size, p);
1552+
s->l += 6;
15421553
}
1543-
} else e |= kputc(size<<4|type, s) < 0;
1544-
return e == 0 ? 0 : -1;
1554+
}
1555+
return 0;
15451556
}
15461557

15471558
static inline int bcf_enc_inttype(long x)
@@ -1553,27 +1564,35 @@ static inline int bcf_enc_inttype(long x)
15531564

15541565
static inline int bcf_enc_int1(kstring_t *s, int32_t x)
15551566
{
1556-
uint32_t e = 0;
1557-
uint8_t z[4];
1567+
if (ks_resize(s, s->l + 5) < 0)
1568+
return -1;
1569+
uint8_t *p = (uint8_t *)s->s + s->l;
1570+
15581571
if (x == bcf_int32_vector_end) {
1559-
e |= bcf_enc_size(s, 1, BCF_BT_INT8);
1560-
e |= kputc(bcf_int8_vector_end, s) < 0;
1572+
// An inline implementation of bcf_enc_size with size==1 and
1573+
// memory allocation already accounted for.
1574+
*p = (1<<4) | BCF_BT_INT8;
1575+
p[1] = bcf_int8_vector_end;
1576+
s->l+=2;
15611577
} else if (x == bcf_int32_missing) {
1562-
e |= bcf_enc_size(s, 1, BCF_BT_INT8);
1563-
e |= kputc(bcf_int8_missing, s) < 0;
1578+
*p = (1<<4) | BCF_BT_INT8;
1579+
p[1] = bcf_int8_missing;
1580+
s->l+=2;
15641581
} else if (x <= BCF_MAX_BT_INT8 && x >= BCF_MIN_BT_INT8) {
1565-
e |= bcf_enc_size(s, 1, BCF_BT_INT8);
1566-
e |= kputc(x, s) < 0;
1582+
*p = (1<<4) | BCF_BT_INT8;
1583+
p[1] = x;
1584+
s->l+=2;
15671585
} else if (x <= BCF_MAX_BT_INT16 && x >= BCF_MIN_BT_INT16) {
1568-
i16_to_le(x, z);
1569-
e |= bcf_enc_size(s, 1, BCF_BT_INT16);
1570-
e |= kputsn((char*)&z, 2, s) < 0;
1586+
*p = (1<<4) | BCF_BT_INT16;
1587+
i16_to_le(x, p+1);
1588+
s->l+=3;
15711589
} else {
1572-
i32_to_le(x, z);
1573-
e |= bcf_enc_size(s, 1, BCF_BT_INT32);
1574-
e |= kputsn((char*)&z, 4, s) < 0;
1590+
*p = (1<<4) | BCF_BT_INT32;
1591+
i32_to_le(x, p+1);
1592+
s->l+=5;
15751593
}
1576-
return e == 0 ? 0 : -1;
1594+
1595+
return 0;
15771596
}
15781597

15791598
/// Return the value of a single typed integer.

kstring.c

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -204,8 +204,17 @@ char *kstrtok(const char *str, const char *sep_in, ks_tokaux_t *aux)
204204
for (p = start; *p; ++p)
205205
if (aux->tab[*p>>6]>>(*p&0x3f)&1) break;
206206
} else {
207-
for (p = start; *p; ++p)
208-
if (*p == aux->sep) break;
207+
// Using strchr is fast for next token, but slower for
208+
// last token due to extra pass from strlen. Overall
209+
// on a VCF parse this func was 146% faster with // strchr.
210+
// Equiv to:
211+
// for (p = start; *p; ++p) if (*p == aux->sep) break;
212+
213+
// NB: We could use strchrnul() here from glibc if detected,
214+
// which is ~40% faster again, but it's not so portable.
215+
// i.e. p = (uint8_t *)strchrnul((char *)start, aux->sep);
216+
uint8_t *p2 = (uint8_t *)strchr((char *)start, aux->sep);
217+
p = p2 ? p2 : start + strlen((char *)start);
209218
}
210219
aux->p = (const char *) p; // end of token
211220
if (*p == 0) aux->finished = 1; // no more tokens

0 commit comments

Comments
 (0)