Skip to content

Netstacklat: Evaluation porting Simon's netstacklat to ebpf_exporter #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 15 commits into
base: master
Choose a base branch
from

Conversation

netoptimizer
Copy link
Owner

Created this PR so @simosund can see the changes and evaluation numbers from prod experiment.

Based on code from PR xdp-project/bpf-examples#129

This is a direct copy from bpf-examples
but from Simon's devel branch 'netstacklat-groupby'

https://github.com/simosund/bpf-examples/tree/netstacklat-groupby/netstacklat

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Keeping this as seperate commit to track what needed to change

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Gotcha#1: My devel laptop have TAI offset zero
 - Other systems (incl prod) all have 37 sec

Gotcha#2: RX-timestamping need to be enabled maually
 - something else have to enable RX-timestamping

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
For ebpf_exporter we cannot control which BPF sections
gets loaded.  Instead we compile time disable some 
of the hooks via define/ifdef's.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Instead hardcode ifindex limits based on prod setup.
As we don't have a way to configure this via YAML.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Execution runtime: netstacklat_tcp_recv_timestamp
 - run_time_ns 18885872401 run_cnt 71,443,820 = 264.34 ns

Execution runtime: netstacklat_skb_consume_udp
 - run_time_ns 6124797061 run_cnt 16,324,309 = 375.19 ns

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Something is buggy with this filter
 - All latency records is on max bucket

The READ_ONCE change doesn't fix the issue

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
This was the real reason for seeing wrong numbers in prod.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
…sync

Execution runtime: netstacklat_tcp_recv_timestamp
 - run_time_ns 32164560127 run_cnt 116,590,498 = 275.88 ns

Execution runtime: netstacklat_skb_consume_udp
 - run_time_ns 10490230543 run_cnt 23,993,428 = 437.21 ns

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Moved filter_nonempty_sockqueue to callers
 - because record_socket_latency() becomes a BPF function-call
 - perf e.g. shows bpf_prog_fb69587c6ea462b7_record_socket_latency

Execution runtime: netstacklat_tcp_recv_timestamp
 - run_time_ns 181391511590 run_cnt 788663546 = 229.99 sn

Execution runtime: netstacklat_skb_consume_udp
 - run_time_ns 16212598612 run_cnt 137812779 = 117.64 ns

This clearly have a huge improvement for UDP packets.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Enable filter_nonempty_sockqueue compile time to make sure
that this config setting doesn't influence performance.

I'm only seeing a small effect.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
The filter_nonempty_sockqueue() is effecient for UDP packets,
but doesn't work well for TCP packets.

Add filtering on socket queue lenght. Try filtering
if qlen is not above 3 packets for TCP.

Execution runtime: netstacklat_tcp_recv_timestamp
 - run_time_ns 10690540076 run_cnt 117852699 = 90.71 ns

Execution runtime: netstacklat_skb_consume_udp
 - run_time_ns 2206621632 run_cnt 20004338 = 110.30 ns

This have a HUGE improvement for TCP case.

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
@@ -462,6 +466,9 @@ int BPF_PROG(netstacklat_tcp_recv_timestamp, void *msg, struct sock *sk,
if (!filter_nonempty_sockqueue(sk))
return 0;

if (!filter_queue_len(sk, 3))
return 0;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a previous commit you mentioned that the filter_nonempty_socketqueue() seemed to be bugged and resulting in all entries ending up in the max latency bucket. Does this solution not suffer from that issue?

If this doesn't have that issue, and you have time, I would appreciate if you could also try with without CONFIG_MAP_MACROS, to see if the issue is perhaps related to my increment_exp2_histogram_nosync() implementation (i.e. switching over to the macro is what made the bug disappear).

The filter should of course only affect which values are recorded (but not change the values itself), and if what should be the equivalent of qlen > 0 results in only max latency values, then logically qlen > 3 should also have only max latency values (as that's a strict subset of qlen > 0).

Also, good to see that it reduced the overhead a lot. But would also be good to check how this filter affects the latency distribution (compared to running without the filter), and check if it still captures most of the high-latency instances. As I guess the assumption behind this filter is that it's mostly interesting to capture the instances where we have high latency on sockets with a bit of a backlog of data to deliver. But if it turns out that most of the high-latency values come from sockets with a single packet (i.e. become empty after the read) then we maybe need to consider if this is a sensible filter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants