You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
lightway-server: Use i/o uring for all i/o, not just tun.
This does not consistently improve performance but reduces CPU overheads (by
around 50%-100% i.e. half to one core) under heavy traffic, which adding
perhaps a few hundred Mbps to a speedtest.net download test and making
negligible difference to the upload test. It also removes about 1ms from the
latency in the same tests. Finally the STDEV across multiple test runs appears
to be lower.
This appears to be due to a combination of avoiding async runtime overheads, as
well as removing various channels/queues in favour of a more direct model of
interaction between the ring and the connections.
As well as those benefits we are now able to reach the same level of
performance with far fewer slots used for the TUN rx path, here we use 64 slots
(by default) and reach the same performance as using 1024 previously. The way
uring handles blocking vs async for tun devices seems to be non-optimal. In
blocking mode things are very slow. In async mode more and more time is spent
on bookkeeping and polling, as the number of slots is increased, plus a high
level of EAGAIN results (due to a request timing out after multiple failed
polls[^0]) which waste time requeueing. This is related to
axboe/liburing#886 and
axboe/liburing#239.
For UDP/TCP sockets io uring behaves well with the socket in blocking mode
which avoids processing lots of EAGAIN results.
Tuning the slots for each I/O path is a bit of an art (more is definitely not
always better) and the sweet spot varies depending on the I/O device, so
provide various tunables instead of just splitting the ring evenly. With this
there's no real reason to have a very large ring, it's the number of inflight
requests which matters.
This is specific to the server since it relies on kernel features and
correctness(/lack of bugs) which may not be upheld on an arbitrary client
system (while it is assumed that server operators have more control over what
they run). It is also not portable to non-Linux systems. It is known to work
with Linux 6.1 (as found in Debian 12 AKA bookworm).
Note that this kernel version contains a bug which causes the `iou-sqp-*`
kernel thread to get stuck (unkillable) if the tun is in blocking mode,
therefore an option is provided. Enabling that option on a kernel which
contains [the fix][] allows equivalent performance with fewer slots on the
ring.
[^0]: When data becomes available _all_ requests are woken but only one will
find data, the rest will see EAGAIN and after a certain number of such
events I/O uring will propagate this back to userspace.
[the fix]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=438b406055cd21105aad77db7938ee4720b09bee
0 commit comments