Further VCF reading speeds optimisations.

jkbonfield · jkbonfield · commit 231897568c45 · 2023-08-07T17:07:14.000+01:00
This is the main meat of the VCF read speedup, following on from the
previous code refactoring.  Combined timings on testing GNOMAD very
INFO heavy single-sample file, a many-sample (approx 4000) FORMAT rich
file for different compilers, and the GIAB HG002 VCF truth set are:

INFO heavy          (15-29% speedup)    (34-39% speedup)
                    dev(s)  PR(s)       dev(s)  PR(s)
    clang13         6.29    5.34        2.84    1.85
    gcc13           6.74    5.22        2.93    1.93
    gcc7            7.96    5.65        3.25    1.98

FORMAT heavy        (6-19% speedup)     (18-22% speedup)
                    dev     PR          dev     PR
    clang13         9.17    8.58        5.45    4.48
    gcc13           9.88    8.04        5.08    3.95
    gcc7            9.12    8.33        4.87    3.98

GIAB HG002          (28-29% speedup)    (33-37% speedup)
                    dev     PR          dev     PR
    clang13         12.88   9.30        5.12    3.29
    gcc13           12.04   8.60        4.74    3.19
    gcc7            12.87   9.37        5.32    3.34

(Tested on Intel Xeon) Gold 6142 and an AMD Zen4 respectively)

Bigger speedups (see first message in PR) were seen on some older
hardware.

Specific optimisations along with estimates of their benefit include,
in approximate order of writing / testing:

- Adding consts and caching of bcf_hdr_nsamples(h).
  No difference on system gcc (gcc7) and clang13, but a couple percent
  gain on gcc13.

- Remove the need for most calls to hts_str2uint by recognising that
  most GT numbers are single digits.  This was 4-5% saving for gcc and
  9-10% on clang.

- Remove kputc calls in bcf_enc_vint / bcf_enc_size, avoiding repeated
  ks_resize checking.  This is a further ~10% speedup.

- Unrolling in bcf_enc_vint to encourage SIMD.

- Improve speed of bgzf_getline and kstrrok via memchr/strchr.

  In tabix timings indexing VCF, bgzf_getline change is 9-22% quicker
  with clang 13 and 19-25% quicker with gcc 7.  I did investigate a
  manually unrolled 64-bit search, before I remembered the existance of
  memchr (doh!).  This is often faster on clang (17-19%), but marginally
  slower on gcc.  The actual speed up to this function however is
  considerably more (3-4x quicker).

  For interest, I include the equivalent code here, as it may be useful in
  other contexts:

    #if HTS_ALLOW_UNALIGNED != 0 &amp;&amp; ULONG_MAX == 0xffffffffffffffff
    // 64-bit unrolled delim detection
    #define haszero(x) (((x)-0x0101010101010101UL)&amp;~(x)&amp;0x8080808080808080UL)
            // Quicker broadcast on clang than bit shuffling in delim
            union {
                uint64_t d8;
                uint8_t d[8];
            } u;
            memset(u.d, delim, 8);
            const uint64_t d8 = u.d8;

            uint64_t *b8 = (uint64_t *)(&amp;buf[fp-&gt;block_offset]);
            const int l8 = (fp-&gt;block_length-fp-&gt;block_offset)/8;
            for (l = 0; l &lt; (l8 &amp; ~3); l+=4) {
                if (haszero(b8[l+0] ^ d8))
                    break;
                if (haszero(b8[l+1] ^ d8)) {
                    l++;
                    break;
                }
                if (haszero(b8[l+2] ^ d8)) {
                    l+=2;
                    break;
                }
                if (haszero(b8[l+3] ^ d8)) {
                    l+=3;
                    break;
                }
            }
            l *= 8;
            for (l += fp-&gt;block_offset;
                 l &lt; fp-&gt;block_length &amp;&amp; buf[l] != delim;
                 l++);

  The analogous kstrtok change is using strchr+strlen instead of memchr
  as we don't know the string end. This makes kstrtok around 150%
  quicker when parsing a single sample VCF.

  When not finding aux-&gt;sep in the string, strchr returns NULL rather
  than end of string, so we need an additional strlen to set aux-&gt;p.
  However there is also glibc's strchrnul which solves this in a single
  call.  This makes kstrtok another 40% quicker on this test, but
  overall it's not a big bottleneck any more.

- Use strchr in vcf_parse_info.

  This is a major speed increase over manual searching on Linux.
  TODO: is this just glibc?  Eg libmusl speeds, etc?  Other OSes?

  It saves about 33% of time in vcf_parse (vcf_parse_info inlined to it)
  with gcc.  Even more with clang.  The total speed gain on a single
  sample VCF view (GIAB truth set) is 12-19% fewer cycles:

- Minor "GT" check improvement.  This has no real affect on gcc13 and
  clang13, but the system gcc (gcc7) speeds up single sample VCF decoding by 7%

- Speed up the unknown value check (strcmp(p, ".").  Helps gcc7 the
  most (9%), with gcc13/clang13 in the 3-4% gains.

- Speed up vcf_parse_format_max3.

  This is the first parse through the FORMAT fields.  Ideally we'd merge
  this and fill5 (the other parse through), but that is harder due to
  the data pivot / rotate.

  For now we just optimise the existing code path.  Instead of a
  laborious switch character by character, we have an initial tight loop
  to find the first meta-character and then a switch to do char
  dependant code.

  This is 5% to 13% speed up depending on data set.

- Remove kputc and minimise resize for bcf_enc_int1.
  3-8% speedup depending on data / compiler.

- Use memcmp instead of strcmp for "END" and ensure we have room.
  Also memset over explicit nulling of arrays.

- Force BCF header dicts to be larger than needed.

  This is a tactic to reduce hash collisions due to the use of overly
  simple hash functions.  It seems to typically be around 3-8% speed gain.

- Restructure of main vcf_parse function.

  This can speed things up by 6-7% on basic single-sample files.

  The previous loop caused lots of branch prediction misses due to the
  counter 'i' being used to do 8 different parts of code depending on
  token number.  Additionally it's got better error checking now as
  previously running out of tokens early just did a return 0 rather than
  complaining about missing columns.
diff --git a/bgzf.c b/bgzf.c
@@ -2280,7 +2280,13 @@ int bgzf_getline(BGZF *fp, int delim, kstring_t *str)
             if (fp->block_length == 0) { state = -1; break; }
         }
         unsigned char *buf = fp->uncompressed_block;
-        for (l = fp->block_offset; l < fp->block_length && buf[l] != delim; ++l);
+
+        // Equivalent to a naive byte by byte search from
+        // buf + block_offset to buf + block_length.
+        void *e = memchr(&buf[fp->block_offset], delim,
+                         fp->block_length - fp->block_offset);
+        l = e ? (unsigned char *)e - buf : fp->block_length;
+
         if (l < fp->block_length) state = 1;
         l -= fp->block_offset;
         if (ks_expand(str, l + 2) < 0) { state = -3; break; }
diff --git a/htslib/vcf.h b/htslib/vcf.h
@@ -1522,26 +1522,37 @@ static inline int bcf_format_gt(bcf_fmt_t *fmt, int isample, kstring_t *str)
 
 static inline int bcf_enc_size(kstring_t *s, int size, int type)
 {
-    uint32_t e = 0;
-    uint8_t x[4];
-    if (size >= 15) {
-        e |= kputc(15<<4|type, s) < 0;
-        if (size >= 128) {
-            if (size >= 32768) {
-                i32_to_le(size, x);
-                e |= kputc(1<<4|BCF_BT_INT32, s) < 0;
-                e |= kputsn((char*)&x, 4, s) < 0;
-            } else {
-                i16_to_le(size, x);
-                e |= kputc(1<<4|BCF_BT_INT16, s) < 0;
-                e |= kputsn((char*)&x, 2, s) < 0;
-            }
+    // Most common case is first
+    if (size < 15) {
+        if (ks_resize(s, s->l + 1) < 0)
+            return -1;
+        uint8_t *p = (uint8_t *)s->s + s->l;
+        *p++ = (size<<4) | type;
+        s->l++;
+        return 0;
+    }
+
+    if (ks_resize(s, s->l + 6) < 0)
+        return -1;
+    uint8_t *p = (uint8_t *)s->s + s->l;
+    *p++ = 15<<4|type;
+
+    if (size < 128) {
+        *p++ = 1<<4|BCF_BT_INT8;
+        *p++ = size;
+        s->l += 3;
+    } else {
+        if (size < 32768) {
+            *p++ = 1<<4|BCF_BT_INT16;
+            i16_to_le(size, p);
+            s->l += 4;
         } else {
-            e |= kputc(1<<4|BCF_BT_INT8, s) < 0;
-            e |= kputc(size, s) < 0;
+            *p++ = 1<<4|BCF_BT_INT32;
+            i32_to_le(size, p);
+            s->l += 6;
         }
-    } else e |= kputc(size<<4|type, s) < 0;
-    return e == 0 ? 0 : -1;
+    }
+    return 0;
 }
 
 static inline int bcf_enc_inttype(long x)
@@ -1553,27 +1564,35 @@ static inline int bcf_enc_inttype(long x)
 
 static inline int bcf_enc_int1(kstring_t *s, int32_t x)
 {
-    uint32_t e = 0;
-    uint8_t z[4];
+    if (ks_resize(s, s->l + 5) < 0)
+        return -1;
+    uint8_t *p = (uint8_t *)s->s + s->l;
+
     if (x == bcf_int32_vector_end) {
-        e |= bcf_enc_size(s, 1, BCF_BT_INT8);
-        e |= kputc(bcf_int8_vector_end, s) < 0;
+        // An inline implementation of bcf_enc_size with size==1 and
+        // memory allocation already accounted for.
+        *p = (1<<4) | BCF_BT_INT8;
+        p[1] = bcf_int8_vector_end;
+        s->l+=2;
     } else if (x == bcf_int32_missing) {
-        e |= bcf_enc_size(s, 1, BCF_BT_INT8);
-        e |= kputc(bcf_int8_missing, s) < 0;
+        *p = (1<<4) | BCF_BT_INT8;
+        p[1] = bcf_int8_missing;
+        s->l+=2;
     } else if (x <= BCF_MAX_BT_INT8 && x >= BCF_MIN_BT_INT8) {
-        e |= bcf_enc_size(s, 1, BCF_BT_INT8);
-        e |= kputc(x, s) < 0;
+        *p = (1<<4) | BCF_BT_INT8;
+        p[1] = x;
+        s->l+=2;
     } else if (x <= BCF_MAX_BT_INT16 && x >= BCF_MIN_BT_INT16) {
-        i16_to_le(x, z);
-        e |= bcf_enc_size(s, 1, BCF_BT_INT16);
-        e |= kputsn((char*)&z, 2, s) < 0;
+        *p = (1<<4) | BCF_BT_INT16;
+        i16_to_le(x, p+1);
+        s->l+=3;
     } else {
-        i32_to_le(x, z);
-        e |= bcf_enc_size(s, 1, BCF_BT_INT32);
-        e |= kputsn((char*)&z, 4, s) < 0;
+        *p = (1<<4) | BCF_BT_INT32;
+        i32_to_le(x, p+1);
+        s->l+=5;
     }
-    return e == 0 ? 0 : -1;
+
+    return 0;
 }
 
 /// Return the value of a single typed integer.
diff --git a/kstring.c b/kstring.c
@@ -204,8 +204,17 @@ char *kstrtok(const char *str, const char *sep_in, ks_tokaux_t *aux)
 		for (p = start; *p; ++p)
 			if (aux->tab[*p>>6]>>(*p&0x3f)&1) break;
 	} else {
-		for (p = start; *p; ++p)
-			if (*p == aux->sep) break;
+		// Using strchr is fast for next token, but slower for
+		// last token due to extra pass from strlen.  Overall
+		// on a VCF parse this func was 146% faster with // strchr.
+		// Equiv to:
+		// for (p = start; *p; ++p) if (*p == aux->sep) break;
+
+		// NB: We could use strchrnul() here from glibc if detected,
+		// which is ~40% faster again, but it's not so portable.
+		// i.e.   p = (uint8_t *)strchrnul((char *)start, aux->sep);
+		uint8_t *p2 = (uint8_t *)strchr((char *)start, aux->sep);
+		p = p2 ? p2 : start + strlen((char *)start);
 	}
 	aux->p = (const char *) p; // end of token
 	if (*p == 0) aux->finished = 1; // no more tokens
diff --git a/vcf.c b/vcf.c