Skip to content

remove port check, document configuration with tc, and support layer 3 interfaces #1

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

arinc9
Copy link

@arinc9 arinc9 commented Jul 20, 2025

Hey Matt!

This pull request removes the port check logic from the BPF programme, documents configuration with tc, brings support for layer 3 interfaces, and improves the documentation.

Cheers.
Chester A.

@arinc9 arinc9 changed the title {readme,tc}: remove port check and document configuration with tc remove port check, document configuration with tc, and support layer 3 interfaces Jul 21, 2025
Copy link
Member

@matttbe matttbe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @arinc9,

Thank you for this PR. The modification in terms of code looks OK to me, just a small comment in the README and test.sh.

I quickly tested it and I noticed a performance drop. I think I wrote something about that somewhere in the csum branch: I was suspecting a performance drop with this modification because TC will have to parse each packet up to the layer 4 to find the port, and the BPF program will do the same to read other parts of the L4. Without your modifications, I can typically reach ~3.25 Gbps with iperf3 -c 10.0.2.2 -ZR when using test.sh. In the same conditions, with your modifications, the performances go down to ~2.6 Gbps, so around 20%. TBH, I wouldn't have expected a so big impact, ~20% seems quite high.

I like the fact it simplifies the eBPF C code, but the perf impact is maybe not worth it. It depends on other checks (packet mark?) that need to be done. WDYT?

By chance, are there no other alternatives? In some BPF programs, I know that you can set some variables per program, typically to set the port(s), or the side. But I don't think you can do that here with the TC hooks, right? I guess an alternative would be to use maps, but I guess there would be an impact as well (and probably the program will no longer be loaded by the tc filter command).

The other notes you have are still interesting. We could also document the idea and point to this PR to say that this version is more flexible, but there is a perf impact. (or do the opposite)

@arinc9
Copy link
Author

arinc9 commented Jul 28, 2025

Without your modifications, I can typically reach ~3.25 Gbps with iperf3 -c 10.0.2.2 -ZR when using test.sh. In the same conditions, with your modifications, the performances go down to ~2.6 Gbps, so around 20%. TBH, I wouldn't have expected a so big impact, ~20% seems quite high.

Can you try the u32 filter instead of flower? Let's see if that performs better. Example:

tc filter add dev "${IFACE}" egress  u32 match ip dport 5201 0xffff action goto chain 1

@matttbe
Copy link
Member

matttbe commented Jul 28, 2025

Without your modifications, I can typically reach ~3.25 Gbps with iperf3 -c 10.0.2.2 -ZR when using test.sh. In the same conditions, with your modifications, the performances go down to ~2.6 Gbps, so around 20%. TBH, I wouldn't have expected a so big impact, ~20% seems quite high.

Can you try the u32 filter instead of flower? Let's see if that performs better. Example:

tc filter add dev "${IFACE}" egress  u32 match ip dport 5201 0xffff action goto chain 1

Switching to one port instead of a range helps to go from ~2.6 (~20% drop) to ~2.8 Gbps (~15% drop)

diff --git a/test.sh b/test.sh
index 0dc9466..26bb3e6 100755
--- a/test.sh
+++ b/test.sh
@@ -40,15 +40,15 @@ server()
 
 tc_client()
 {
-	local ns="${NS}_cpe" iface="int" port_start="5201" port_end="5203"
+	local ns="${NS}_cpe" iface="int" port="5201"
 
 	# ip netns will umount everything on exit
 	ip netns exec "${ns}" sh -c "mount -t debugfs none /sys/kernel/debug && cat /sys/kernel/debug/tracing/trace_pipe" &
 
 	tc -n "${ns}" qdisc add dev "${iface}" clsact
-	tc -n "${ns}" filter add dev "${iface}" egress  protocol ip flower ip_proto tcp dst_port "${port_start}"-"${port_end}" action goto chain 1
+	tc -n "${ns}" filter add dev "${iface}" egress  protocol ip flower ip_proto tcp dst_port "${port}" action goto chain 1
 	tc -n "${ns}" filter add dev "${iface}" egress  chain 1 bpf object-file tcp_in_udp_tc.o section tc action csum udp
-	tc -n "${ns}" filter add dev "${iface}" ingress protocol ip flower ip_proto udp src_port "${port_start}"-"${port_end}" action goto chain 1
+	tc -n "${ns}" filter add dev "${iface}" ingress protocol ip flower ip_proto udp src_port "${port}" action goto chain 1
 	tc -n "${ns}" filter add dev "${iface}" ingress chain 1 bpf object-file tcp_in_udp_tc.o section tc direct-action
 
 	tc -n "${ns}" filter show dev "${iface}" egress
@@ -61,15 +61,15 @@ tc_client()
 
 tc_server()
 {
-	local ns="${NS}_net" iface="int" port_start="5201" port_end="5203"
+	local ns="${NS}_net" iface="int" port="5201"
 
 	# ip netns will umount everything on exit
 	ip netns exec "${ns}" sh -c "mount -t debugfs none /sys/kernel/debug && cat /sys/kernel/debug/tracing/trace_pipe" &
 
 	tc -n "${ns}" qdisc add dev "${iface}" clsact
-	tc -n "${ns}" filter add dev "${iface}" egress  protocol ip flower ip_proto tcp src_port "${port_start}"-"${port_end}" action goto chain 1
+	tc -n "${ns}" filter add dev "${iface}" egress  protocol ip flower ip_proto tcp src_port "${port}" action goto chain 1
 	tc -n "${ns}" filter add dev "${iface}" egress  chain 1 bpf object-file tcp_in_udp_tc.o section tc action csum udp
-	tc -n "${ns}" filter add dev "${iface}" ingress protocol ip flower ip_proto udp dst_port "${port_start}"-"${port_end}" action goto chain 1
+	tc -n "${ns}" filter add dev "${iface}" ingress protocol ip flower ip_proto udp dst_port "${port}" action goto chain 1
 	tc -n "${ns}" filter add dev "${iface}" ingress chain 1 bpf object-file tcp_in_udp_tc.o section tc direct-action
 
 	tc -n "${ns}" filter show dev "${iface}" egress

And switching to u32 helps: from 2.8 to 3.05 Gbps (5% drop).

diff --git a/test.sh b/test.sh
index 26bb3e6..fbd4b50 100755
--- a/test.sh
+++ b/test.sh
@@ -46,9 +46,9 @@ tc_client()
 	ip netns exec "${ns}" sh -c "mount -t debugfs none /sys/kernel/debug && cat /sys/kernel/debug/tracing/trace_pipe" &
 
 	tc -n "${ns}" qdisc add dev "${iface}" clsact
-	tc -n "${ns}" filter add dev "${iface}" egress  protocol ip flower ip_proto tcp dst_port "${port}" action goto chain 1
+	tc -n "${ns}" filter add dev "${iface}" egress  u32 match ip dport ${port} 0xffff action goto chain 1
 	tc -n "${ns}" filter add dev "${iface}" egress  chain 1 bpf object-file tcp_in_udp_tc.o section tc action csum udp
-	tc -n "${ns}" filter add dev "${iface}" ingress protocol ip flower ip_proto udp src_port "${port}" action goto chain 1
+	tc -n "${ns}" filter add dev "${iface}" ingress u32 match ip sport ${port} 0xffff action goto chain 1
 	tc -n "${ns}" filter add dev "${iface}" ingress chain 1 bpf object-file tcp_in_udp_tc.o section tc direct-action
 
 	tc -n "${ns}" filter show dev "${iface}" egress
@@ -67,9 +67,9 @@ tc_server()
 	ip netns exec "${ns}" sh -c "mount -t debugfs none /sys/kernel/debug && cat /sys/kernel/debug/tracing/trace_pipe" &
 
 	tc -n "${ns}" qdisc add dev "${iface}" clsact
-	tc -n "${ns}" filter add dev "${iface}" egress  protocol ip flower ip_proto tcp src_port "${port}" action goto chain 1
+	tc -n "${ns}" filter add dev "${iface}" egress  u32 match ip sport ${port} 0xffff action goto chain 1
 	tc -n "${ns}" filter add dev "${iface}" egress  chain 1 bpf object-file tcp_in_udp_tc.o section tc action csum udp
-	tc -n "${ns}" filter add dev "${iface}" ingress protocol ip flower ip_proto udp dst_port "${port}" action goto chain 1
+	tc -n "${ns}" filter add dev "${iface}" ingress u32 match ip dport ${port} 0xffff action goto chain 1
 	tc -n "${ns}" filter add dev "${iface}" ingress chain 1 bpf object-file tcp_in_udp_tc.o section tc direct-action
 
 	tc -n "${ns}" filter show dev "${iface}" egress

5% drop but more flexible seems OK.

Do you have similar results on your side?

@arinc9
Copy link
Author

arinc9 commented Jul 28, 2025

I can also see u32 performing better on single thread (iperf3 uses a thread per stream since 2023).

protocol ip flower ip_proto tcp src_port 5201 and equivalent on the other side:

[  5] local 10.0.0.2 port 50618 connected to 10.0.0.1 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.08 GBytes  9.29 Gbits/sec                  
[  5]   1.00-2.00   sec  1.08 GBytes  9.25 Gbits/sec                  
[  5]   2.00-3.00   sec  1.09 GBytes  9.40 Gbits/sec                  
[  5]   3.00-4.00   sec  1.12 GBytes  9.61 Gbits/sec                  
[  5]   4.00-5.00   sec  1.11 GBytes  9.54 Gbits/sec                  
[  5]   5.00-6.00   sec  1.11 GBytes  9.54 Gbits/sec                  
[  5]   6.00-7.00   sec  1.12 GBytes  9.60 Gbits/sec                  
[  5]   7.00-8.00   sec  1.10 GBytes  9.45 Gbits/sec                  
[  5]   8.00-9.00   sec  1.09 GBytes  9.33 Gbits/sec                  
[  5]   9.00-10.00  sec  1.13 GBytes  9.67 Gbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  11.0 GBytes  9.47 Gbits/sec    0            sender
[  5]   0.00-10.00  sec  11.0 GBytes  9.47 Gbits/sec                  receiver

u32 match tcp src 5201 0xffff and equivalent on the other side:

[  5] local 10.0.0.2 port 53260 connected to 10.0.0.1 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.21 GBytes  10.4 Gbits/sec                  
[  5]   1.00-2.00   sec  1.19 GBytes  10.2 Gbits/sec                  
[  5]   2.00-3.00   sec  1.21 GBytes  10.4 Gbits/sec                  
[  5]   3.00-4.00   sec  1.20 GBytes  10.3 Gbits/sec                  
[  5]   4.00-5.00   sec  1.24 GBytes  10.6 Gbits/sec                  
[  5]   5.00-6.00   sec  1.22 GBytes  10.5 Gbits/sec                  
[  5]   6.00-7.00   sec  1.23 GBytes  10.6 Gbits/sec                  
[  5]   7.00-8.00   sec  1.23 GBytes  10.6 Gbits/sec                  
[  5]   8.00-9.00   sec  1.23 GBytes  10.6 Gbits/sec                  
[  5]   9.00-10.00  sec  1.21 GBytes  10.4 Gbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  12.2 GBytes  10.5 Gbits/sec    0            sender
[  5]   0.00-10.00  sec  12.2 GBytes  10.5 Gbits/sec                  receiver

no tc filter, discrimination at the BPF programme:

[  5] local 10.0.0.2 port 36288 connected to 10.0.0.1 port 5201
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  1.26 GBytes  10.9 Gbits/sec                  
[  5]   1.00-2.00   sec  1.31 GBytes  11.2 Gbits/sec                  
[  5]   2.00-3.00   sec  1.31 GBytes  11.2 Gbits/sec                  
[  5]   3.00-4.00   sec  1.29 GBytes  11.1 Gbits/sec                  
[  5]   4.00-5.00   sec  1.26 GBytes  10.8 Gbits/sec                  
[  5]   5.00-6.00   sec  1.26 GBytes  10.8 Gbits/sec                  
[  5]   6.00-7.00   sec  1.25 GBytes  10.8 Gbits/sec                  
[  5]   7.00-8.00   sec  1.24 GBytes  10.6 Gbits/sec                  
[  5]   8.00-9.00   sec  1.25 GBytes  10.7 Gbits/sec                  
[  5]   9.00-10.00  sec  1.26 GBytes  10.9 Gbits/sec                  
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  12.7 GBytes  10.9 Gbits/sec    0            sender
[  5]   0.00-10.00  sec  12.7 GBytes  10.9 Gbits/sec                  receiver

Having multiple filters to allow more than one port forwarded to the BPF programme won't degrade performance. So I'm going to change my patch to use u32.

@arinc9
Copy link
Author

arinc9 commented Jul 28, 2025

u32 match tcp src 5201 0xffff won't match whilst u32 match ip sport 5201 0xffff will. Looking into why.

@arinc9
Copy link
Author

arinc9 commented Jul 29, 2025

Before I get into the tc filter issue, here's my test result with the offloads. Setting gso_max_segs to 0 or 1 is necessary and it's the only option that has an effect. No need to turn off any offloading option using ethtool. Testing on veth interface with these offload options (untouched, LRO is not supported on veth):

$ sudo ip netns exec client ethtool -k eth0 | grep ": on"
rx-checksumming: on
tx-checksumming: on
        tx-checksum-ip-generic: on
        tx-checksum-sctp: on
scatter-gather: on
        tx-scatter-gather: on
        tx-scatter-gather-fraglist: on
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: on
        tx-tcp-mangleid-segmentation: on
        tx-tcp6-segmentation: on
generic-segmentation-offload: on
rx-vlan-offload: on
tx-vlan-offload: on
highdma: on
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: on
tx-ipxip6-segmentation: on
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-sctp-segmentation: on
tx-udp-segmentation: on
tx-gso-list: on
tx-vlan-stag-hw-insert: on
rx-vlan-stag-hw-parse: on

@matttbe
Copy link
Member

matttbe commented Jul 29, 2025

Setting gso_max_segs to 0 or 1 is necessary and it's the only option that has an effect. No need to turn off any offloading option using ethtool. Testing on veth interface with these offload options (untouched, LRO is not supported on veth):

Thank you for having checked that. Don't hesitate to update the README section. In egress, gso_max_segs should indeed be enough. In ingress, I think we don't need anything: we would have UDP packets and I think UDP GRO is only done on demand, e.g. when the userspace asks it (setsockopt(IPPROTO_UDP, UDP_GRO)) or for some in-kernel tunnels. So it is possible gro and lro doesn't need to be disabled, but to be confirmed with HW supporting it. We don't want the packets to be merged or split.

@arinc9
Copy link
Author

arinc9 commented Jul 29, 2025

The only information I could find about LRO in kernel source code is here:

https://github.com/torvalds/linux/blob/86aa721820952b793a12fc6e5a01734186c0c238/Documentation/networking/device_drivers/ethernet/intel/fm10k.rst#generic-receive-offload-aka-gro

If LRO only supports TCP as documented there, then we don't need to disable it as we receive UDP packets.

@matttbe
Copy link
Member

matttbe commented Jul 29, 2025

Indeed, apparently LRO is TCP (and IPv4?) only: https://lwn.net/Articles/358910/

One other nice thing about GRO is that, unlike LRO, it is not limited to TCP/IPv4.

So if LRO is not for UDP and GRO with UDP is on demand only, I guess it means we don't need to change any HW offload.

Don't hesitate to reflect that in the README file. (Probably no need to change anything in test.sh, because there, the tunnelling is not done on the client/server, where the TCP connection is handled.)

@arinc9
Copy link
Author

arinc9 commented Jul 29, 2025

What about the commented out command to set gso_max_segs? Don't I need to uncomment those?

@matttbe
Copy link
Member

matttbe commented Jul 29, 2025

What about the commented out command to set gso_max_segs? Don't I need to uncomment those?

I would say no: the test env is a bit particular:

cli --------- cpe --------- int --------- net --------- srv
       TCP           UDP           UDP           TCP

To be able to see the packets before and after the TC hooks, the BPF hooks are loaded on cpe and net hosts, not on cli and srv which generate the TCP traffic (where gso_max_segs would do something). In this test env, we need to disable GRO, not to receive aggregated TCP packets that cannot fit in one UDP packet on the wire.

See: dae7fd2

Or we could have 2 tests env:

  • the existing one.
  • a new one with just cli-int-srv: the TC hooks are loaded on the client and server + using gso_max_segs on both side.

@arinc9 arinc9 force-pushed the pr branch 2 times, most recently from c065ac6 to 69807db Compare August 8, 2025 18:22
Offloads other than GSO and GRO do not break this type of traffic. Document
disabling GSO and explain why disabling GRO is not needed.

Signed-off-by: Chester A. Unal <[email protected]>
@arinc9
Copy link
Author

arinc9 commented Aug 8, 2025

@matttbe let me know if the current version of the series is ok.

arinc9 added 4 commits August 8, 2025 19:34
The layer 4 protocol and UDP or TCP port can be distinguished by a tc
filter. Document that and remove the logic to discriminate packets by UDP
or TCP port from the BPF programme.

Add warnings to the README.

Signed-off-by: Chester A. Unal <[email protected]>
Cellular interfaces do not include layer 2 header. When reading the Ethernet
header, if there is no IPv4 or IPv6 header found, assume that the packet does
not have an Ethernet header and check whether the protocol is IPv4 or IPv6.

Signed-off-by: Chester A. Unal <[email protected]>
Remove the unused includes. Sort in alphabetical order where possible.

Signed-off-by: Chester A. Unal <[email protected]>
Only the make, clang, libelf-dev, libc6-dev-i386, and libbpf-dev packages
are needed. Document them.

Signed-off-by: Chester A. Unal <[email protected]>
@arinc9
Copy link
Author

arinc9 commented Aug 15, 2025

@matttbe reminder this is still up for review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants