Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,10 @@

## Overview
This is a beta version which
- Uses [DPDK Graph Framework](https://doc.dpdk.org/guides/prog_guide/graph_lib.html) for the data plane. DPDK version 23.11 LTS or compatible needed.
- Uses [DPDK Graph Framework](https://doc.dpdk.org/guides/prog_guide/graph_lib.html) for the data plane. DPDK version 24.11 LTS or compatible needed.
- [rte_flow](https://doc.dpdk.org/guides/prog_guide/rte_flow.html) offloading between the virtual network interfaces on a single heypervisor.
- Uses GRPC to add virtual interfaces, loadbalancers, NAT Gateways and routes. There is a golang based GRPC
test client (CLI) which can connect to the GRPC server
- Supports DHCPv4, DHCPv6, Neighbour Discovery, ARP protocols (sub-set implementations.).
- Has IPv4 overlay and IPv6 underlay support. IPv6 overlay support in progress.
- Supports [high-availability](ha/)
1 change: 1 addition & 0 deletions docs/deployment/help_dpservice-bin.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,7 @@
| --flow-timeout | SECONDS | inactive flow timeout (except TCP established flows) | |
| --multiport-eswitch | None | run on NIC configured in multiport e-switch mode | |
| --active-lockfile | PATH | file to be locked before starting packet processing | |
| --sync-tap | IFNAME | TAP device to use for dpservice-ha synchronization | |

> This file has been generated by dp_conf_generate.py. As such it should fully reflect the output of `--help`.

30 changes: 30 additions & 0 deletions docs/ha/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Dpservice high-availability
Dpservice can be deployed in such a way, where killing/crashing dpservice process does not interrupt the flow of packets (or rather, the interruption is absolutely minimal).

![packet_flow_shema](packet_flow.drawio.png "general packet flow schema")


## Active-standby
Dpservice is acting in a active/standby mode, similar to `ceph-mgr` for example. Once the active one is gone, the standby process is promoted to active and any newly created process is automatically in standby mode.

This is achieved by an exclusive file-lock on shared file, specified by `--active-lockfile`. This is guaranteed to be atomic by the kernel, no polling is needed and reaction time is almost instantaneous.


## Orchestration
For packet flow to not be interruped, it is essential for both dpservice processes to use the same underlay addresses. Without this, the new active dpservice would of course drop current packet flows as they would not be addressed to the right VNF.

This needs a change in [metalnet](https://github.com/ironcore-dev/metalnet) orchestration. Either it needs to connect to both dpservice instances and handle the situation when a process goes down, or it must exist in two instances, each orchestrating a separate dpservice process.

For this to work, dpservice accepts externally generated underlay addresses as a part of gRPC protocol. This way the address can be generated by metalnet and then simply sent to both instances. See the [example use page](example.md) for details.


## Internal state synchronization
While the above is enough for basic high-availability scenario, there are still situation where a packet flow would get interrupted. This is caused by the standby process not having MAC address information (thus forcing it to wait for ARP/ND/DHCP), and by not having NAT entries the active instance has.

To implement proper high-availability without (almost) any flow interruption, some data needs to be synchronized between active and standby instances.

This is achieved via a dedicated bridge with a TAP interface assigned to each instance. This way the [graph loop](../sys_design/) can handle synchronization like any other traffic and not special thread or handler is needed.

Dpservice synchronizes NAT entries, Virtual service entries, MAC addresses of VFs, for details see the [implementation specifics](implementation_specifics.md).

The bridge and two TAP devices are handled by `prepare.sh` and thus by the `initContainer` of the `dp-service` pod.
58 changes: 58 additions & 0 deletions docs/ha/example.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# Example use of dpservice in HA mode


## Preparation
`prepare.sh` only needs to be ran once, both dpservice instances will use the resulting config as they should both be stup up the same.

Special argument `--sync-bridge` has been created to facilitate the shared bridge and TAP devices creation. Currently, only `--multiport-eswitch` setup has been properly tested, so it is recommended too.

> When two dpservice pods are deployed, two init containers with `--sync-bridge` can be spawned. The code is idempotent, so no issue should arise from creating the bridge twice.


## Running dpservice processes
The process itself can be run as usual, with the following required changes:
- EAL argument `--file-prefix` is needed to differentiate DPDK internal state for each
- EAL argument `--vdev` needs to be used to use previously created (see above) TAP device
- dpservice argument `--sync-tap` needs to be used to give the TAP device name to dpservice itself
- dpservice argument `--active-lockfile` is needed to atomically synchronize process states
- dpservice argument `--grpc-port` is required to differentiate the gRPC endpoint for each process

Example `dpservice-a` process:
```
dpservice-bin -l3,5 --file-prefix=dpservice-a --vdev=net_tap_sync,iface=dps_sync_a,persist -- --sync-tap=dps_sync_a --active-lockfile=/run/dpservice/common/active.lock --grpc-port=1338 --no-offload
```

Example `dpservice-b` process:
```
dpservice-bin -l3,5 --file-prefix=dpservice-b --vdev=net_tap_sync,iface=dps_sync_b,persist -- --sync-tap=dps_sync_b --active-lockfile=/run/dpservice/common/active.lock --grpc-port=1339 --no-offload
```

These processes should automatically take up the role of active and standby based on which one locked the `--active-lockfile` first.

Any data needed by the standby process will be sent over via the bridge created earlier by the active process.


## Monitoring dpservice
For monitoring, `dpservice-exporter` needs to be ran in two instances with `--grpc-port` and `--file-prefix` set accordingly. Alternatively `DP_GRPC_PORT` and `DP_FILE_PREFIX` environment variables can be used instead (helpful for container shell environment).


## Orchestraing dpservice
To orchestrate these processes, simply use `dpservice-cli` with proper `--address` argument. Alternatively `DP_GRPC_PORT` environment variable can be used (helpful for container shell environment).

To make sure both dpservices are orchestrated the same way, underlay addresses need to be set externally!

Example with real Mellanox card:
```
# 2 VMs on dpservice-a
dpservice-cli --address localhost:1338 add interface --id test10 --device 0000:03:00.0_representor_c0pf0vf0 --vni 123 --ipv4 192.168.123.10 --ipv6 fe80::10 --underlay fc00::8000:0:10
dpservice-cli --address localhost:1338 add interface --id test11 --device 0000:03:00.0_representor_c0pf0vf1 --vni 123 --ipv4 192.168.123.11 --ipv6 fe80::11 --underlay fc00::8000:0:11
# 2 VMs on dpservice-b
dpservice-cli --address localhost:1339 add interface --id test10 --device 0000:03:00.0_representor_c0pf0vf0 --vni 123 --ipv4 192.168.123.10 --ipv6 fe80::10 --underlay fc00::8000:0:10
dpservice-cli --address localhost:1339 add interface --id test11 --device 0000:03:00.0_representor_c0pf0vf1 --vni 123 --ipv4 192.168.123.11 --ipv6 fe80::11 --underlay fc00::8000:0:11
```

Now connected VMs can communicate, for example one running `iperf -s` and the other `iperf -c 192.168.123.10 -i1 -t300`.

Then even after the active process is killed, communication should still work.

This must also be true for NAT flows, but setting up such example manually is beyond the scope of this document, please refer to the [pytest suite](../../test/local/xtratest_ha.py) for more details.
54 changes: 54 additions & 0 deletions docs/ha/implementation_specifics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Dpservice synchronization specifics
The two instances are connected via a bridge and TAP devices created by `prepare.sh` (init container).

This has been chosen to leverage the graph loop and shy away from any threading/locking otherwise required.

For ease of implementation, the communication happens using Ethernet packets with custom payload. Due to the fact that both instances will by definition run on the same machine and thus the same architecture and will be built the same way, endianness is not forced, but rather defined by the process binaries. Packing is used only for efficiency.

There is a simple message protocol defined in `../../include/dp_sync.h`.


## Roles
Message handling differs based on the state of dpservice. The active one only accepts requests (e.g. "send tables") and the standy one only accepts data synchronization (e.g. "create/delete new NAT entry").

While the standby dpservice does not process packets from PFs/VFs, it does process packets from the synchronization TAP device.

The standby dpservice graph loop is intentionally slowed down by sleeping during iterations. To make it responsive when synchronization happens, the graph node responsibel for handling synchronization messages can read packets multiple times before returning control to the graph loop.


## Protocol
Both processes start in standby mode. The first one to acquire an exclusive file lock on `--active-lockfile` file becomes the active process.

1. On activation the process takes any pending NAT entries from synchonization and creates flow table entries for them.
2. When the standby process starts, it sends a request to the active one to dump all entries and then only updates are sent. This is helpful when only the standby process is restarted (e.g. updates).
3. When the active process encounters a change in internal state that the standby process requires, it is sent over to the standby process.
4. When the active process dies (crash or update), Linux automatically opens the exclusive file lock and the standby process automatically takes it over, thus becoming active again.
5. Repeat from step 1, the process roles are now swapped.


## Concurrency
Given the atomicity of exclusive file locking and the fact that the lock is **NEVER** released voluntarily, there is no way of an active process becoming a standby one. There should be no situation where there are two active processes and both processes should always know their roles.

When a message arrives over the synchronization TAP interface, messages not applicable to the current role (active/standby) are reacted upon.

Since the two processes are connected via a bridge and TAP devices, the messages are guaranteed to be in order.

The active process always sends over changes and when requested by a new standby process, it dumps all needed entries, but uses the same messages, thus essentialy "just" sends over many changes in a burst. This means the protocol is basically stateless.

All the above taken into account, there should be no way apart from dropped packets to arrive at a split-brain situation.


## Losses
It is theoretically possible (but except for the queue overflowing it should be highly improbable) that a synchronization packet does not arrive. This will result in a missed entry creation or deletion. However, packet flows are highly dynamic, so this loss should have no effect after an hour or less.

Missing the "please send all changes" message is worse, but again, this will fix itself over time as the new packet flows will be sent over and the old ones simply time out anyway.


## TAP device configuration
These TAP devices should normally be created by DPDK, i.e. via `--vdev` EAL parameter. However due to the bridge requirement, it is preferrable to only connect them to the bridge once. This is why `prepare.sh` pre-creates both TAP devices and then the `--vdev` EAL parameter needs to also contain `,persist` option.

It is essential that the TAP device is created with `mode tap multi_queue` option, otherwise DPDK refuses to use it.

It is highly recommended to set `txqueuelen` really high (e.g. 100000), because the queue acts as a buffer for situation where many synchronization messages are being sent (i.e. after restart of the standby process).

It is also beneficial to disable IPv6 and multicast snooping, thus eliminating non-dpservice traffic on the connection.
Binary file added docs/ha/packet_flow.drawio.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
8 changes: 8 additions & 0 deletions hack/dp_conf.json
Original file line number Diff line number Diff line change
Expand Up @@ -175,6 +175,14 @@
"var": "active_lockfile",
"type": "char",
"array_size": 256
},
{
"lgopt": "sync-tap",
"arg": "IFNAME",
"help": "TAP device to use for dpservice-ha synchronization",
"var": "sync_tap",
"type": "char",
"array_size": "IF_NAMESIZE"
}
]
}
38 changes: 37 additions & 1 deletion hack/prepare.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@
#

OPT_MULTIPORT=false
OPT_SYNC_BRIDGE=false

BLUEFIELD_IDENTIFIERS=("MT_0000000543" "MT_0000000541")
MAX_NUMVFS_POSSIBLE=126
Expand Down Expand Up @@ -146,7 +147,7 @@
local pf0="${devs[0]}"

lsmod | grep -q '^vfio_pci' || {
log "vfio-pci module not loaded, loading it"
log "vfio-pci module not loaded, loading it"
modprobe vfio-pci
}

Expand Down Expand Up @@ -351,6 +352,31 @@
fi
}

function create_sync_bridge() {
local sync_bridge="dps_sync_br"
local sync_tap_a="dps_sync_a"
local sync_tap_b="dps_sync_b"

log "Creating shared bridge between dpservice-a and dpservice-b"

ip link show $sync_bridge || ip link add $sync_bridge type bridge
# Prevents unnecessary traffic on the bridge
echo 0 > /sys/class/net/$sync_bridge/bridge/multicast_snooping

for sync_tap in $sync_tap_a $sync_tap_b; do
ip link show $sync_tap || ip tuntap add dev $sync_tap mode tap multi_queue
# Large queue in case the new dpservice is slow in processing
ip link set $sync_tap txqueuelen 100000
ip link set $sync_tap master $sync_bridge

Check warning on line 370 in hack/prepare.sh

View check run for this annotation

In Solidarity / Inclusive Language

Match Found

Please consider an alternative to `master`. Possibilities include: `primary`, `main`, `leader`, `active`, `writer`
Raw output
/master/gi
done

for iface in $sync_tap_a $sync_tap_b $sync_bridge; do
# Prevents unnecessary NA/ND traffic
sysctl net.ipv6.conf.$iface.disable_ipv6=1
ip link set $iface up
done
}

# main
CONFIG_EXISTS=false
if [[ -e $CONFIG ]]; then
Expand All @@ -374,6 +400,9 @@
--vfio-bind-only)
VFIO_BIND_ONLY=true
;;
--sync-bridge)
OPT_SYNC_BRIDGE=true
;;
*)
err "Invalid argument $1"
esac
Expand All @@ -395,3 +424,10 @@
fi
create_vf
make_config

# Create shared connection between dpservice-a and dpservice-b
# This has the downside of being called twice (each dpservice will try to do it)
# But the operation is idempotent, so it should not be a problem
if [[ "$OPT_SYNC_BRIDGE" == "true" ]]; then
create_sync_bridge
fi
6 changes: 6 additions & 0 deletions include/dp_cntrack.h
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,9 @@

#include <rte_mbuf.h>

#include "dp_flow.h"
#include "dp_mbuf_dyn.h"
#include "dp_nat.h"

#ifdef __cplusplus
extern "C" {
Expand All @@ -18,6 +20,10 @@ int dp_cntrack_handle(struct rte_mbuf *m, struct dp_flow *df);

void dp_cntrack_flush_cache(void);

int dp_cntrack_from_sync_nat(const struct netnat_portoverload_tbl_key *portoverload_key,
const struct netnat_portoverload_sync_metadata *sync_metadata,
uint64_t timestamp);

#ifdef __cplusplus
}
#endif
Expand Down
4 changes: 3 additions & 1 deletion include/dp_conf.h
Original file line number Diff line number Diff line change
Expand Up @@ -41,12 +41,14 @@ void dp_conf_free(void);
#include "dp_conf_opts.h"

// Custom getters
int dp_conf_is_wcmp_enabled(void);
bool dp_conf_is_wcmp_enabled(void);
const char *dp_conf_get_eal_a_pf0(void);
const char *dp_conf_get_eal_a_pf1(void);
const union dp_ipv6 *dp_conf_get_underlay_ip(void);
const struct dp_conf_dhcp_dns *dp_conf_get_dhcp_dns(void);
const struct dp_conf_dhcp_dns *dp_conf_get_dhcpv6_dns(void);
bool dp_conf_is_tap_mode(void);
bool dp_conf_is_sync_enabled(void);

#ifdef ENABLE_VIRTSVC
const struct dp_conf_virtual_services *dp_conf_get_virtual_services(void);
Expand Down
1 change: 1 addition & 0 deletions include/dp_conf_opts.h
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@ int dp_conf_get_flow_timeout(void);
#endif
bool dp_conf_is_multiport_eswitch(void);
const char *dp_conf_get_active_lockfile(void);
const char *dp_conf_get_sync_tap(void);

enum dp_conf_runmode {
DP_CONF_RUNMODE_NORMAL, /**< Start normally */
Expand Down
1 change: 1 addition & 0 deletions include/dp_flow.h
Original file line number Diff line number Diff line change
Expand Up @@ -158,6 +158,7 @@ void dp_remove_nat_flows(uint16_t port_id, enum dp_flow_nat_type nat_type);
void dp_remove_neighnat_flows(uint32_t ipv4, uint32_t vni, uint16_t min_port, uint16_t max_port);
void dp_remove_iface_flows(uint16_t port_id, uint32_t ipv4, uint32_t vni);
void dp_remove_lbtarget_flows(const union dp_ipv6 *ul_addr);
void dp_synchronize_local_nat_flows(void);

hash_sig_t dp_get_conntrack_flow_hash_value(const struct flow_key *key);

Expand Down
12 changes: 12 additions & 0 deletions include/dp_ipaddr.h
Original file line number Diff line number Diff line change
Expand Up @@ -169,6 +169,9 @@ int dp_str_to_ipv4(const char *src, uint32_t *dest);
int dp_str_to_ipv6(const char *src, union dp_ipv6 *dest);

void dp_generate_ul_ipv6(union dp_ipv6 *dest, uint8_t addr_type);
#ifdef ENABLE_VIRTSVC
void dp_generate_virtsvc_ul_ipv6(union dp_ipv6 *dest, uint32_t index);
#endif


// structure for holding dual IP addresses
Expand Down Expand Up @@ -221,6 +224,15 @@ void dp_set_ipaddr4(struct dp_ip_address *addr, uint32_t ipv4)
addr->_ipv4 = ipv4;
}

static __rte_always_inline
void dp_set_ipaddr_nat64(struct dp_ip_address *dst, rte_be32_t ipv4)
{
dst->_is_v6 = true;
dst->_ipv6._prefix = dp_nat64_prefix._prefix;
dst->_ipv6._suffix = dp_nat64_prefix._suffix;
dst->_ipv6._nat64.ipv4 = ipv4;
}

int dp_ipaddr_to_str(const struct dp_ip_address *addr, char *dest, int dest_len);
#define DP_IPADDR_TO_STR(ADDR, DST) do { \
static_assert(sizeof(DST) >= INET6_ADDRSTRLEN, "Insufficient buffer size for DP_IPADDR_TO_STR()"); \
Expand Down
17 changes: 16 additions & 1 deletion include/dp_nat.h
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,13 @@ struct netnat_portoverload_tbl_key {
uint8_t l4_type;
} __rte_packed;

struct netnat_portoverload_sync_metadata {
struct netnat_portmap_key portmap_key;
uint16_t created_port_id;
uint16_t icmp_type_src;
rte_be16_t icmp_err_ip_cksum;
};

struct nat_check_result {
bool is_vip_natted;
bool is_network_natted;
Expand Down Expand Up @@ -104,9 +111,17 @@ int dp_add_neighnat_entry(uint32_t nat_ip, uint32_t vni, uint16_t min_port, uin

int dp_del_neighnat_entry(uint32_t nat_ip, uint32_t vni, uint16_t min_port, uint16_t max_port);

int dp_allocate_network_snat_port(struct snat_data *snat_data, struct dp_flow *df, struct dp_port *port);
int dp_allocate_network_snat_port(struct snat_data *snat_data, struct dp_flow *df, struct dp_port *port, rte_be16_t icmp_err_ip_cksum);
int dp_allocate_sync_snat_port(const struct netnat_portmap_key *portmap_key,
struct netnat_portoverload_tbl_key *portoverload_key,
uint16_t created_port_id,
uint16_t icmp_type_src, rte_be16_t icmp_err_ip_cksum);
const union dp_ipv6 *dp_lookup_neighnat_underlay_ip(struct dp_flow *df);
int dp_remove_network_snat_port(const struct flow_value *cntrack);
int dp_remove_sync_snat_port(const struct netnat_portmap_key *portmap_key,
const struct netnat_portoverload_tbl_key *portoverload_key);
int dp_sync_snat_flow(const struct flow_value *flow_val);
int dp_create_sync_snat_flows(void);

int dp_list_nat_local_entries(uint32_t nat_ip, struct dp_grpc_responder *responder);
int dp_list_nat_neigh_entries(uint32_t nat_ip, struct dp_grpc_responder *responder);
Expand Down
9 changes: 7 additions & 2 deletions include/dp_port.h
Original file line number Diff line number Diff line change
Expand Up @@ -113,6 +113,7 @@ struct dp_ports {
extern struct dp_port *_dp_port_table[DP_MAX_PORTS];
extern struct dp_port *_dp_pf_ports[DP_MAX_PF_PORTS];
extern struct dp_ports _dp_ports;
extern struct dp_port _dp_sync_port;


struct dp_port *dp_get_port_by_name(const char *pci_name);
Expand All @@ -123,6 +124,7 @@ void dp_ports_free(void);

int dp_start_port(struct dp_port *port);
int dp_start_pf_port(uint16_t index);
int dp_start_sync_port(void);
int dp_stop_port(struct dp_port *port);

void dp_start_acquiring_neigh_mac(struct dp_port *port);
Expand Down Expand Up @@ -205,11 +207,14 @@ struct dp_port *dp_get_port_by_pf_index(uint16_t index)
}

static __rte_always_inline
bool dp_conf_is_tap_mode(void)
const struct dp_port *dp_get_sync_port(void)
{
return dp_conf_get_nic_type() == DP_CONF_NIC_TYPE_TAP;
return &_dp_sync_port;
}

int dp_set_port_sync_neigh_mac(uint16_t port_id, const struct rte_ether_addr *mac);
void dp_synchronize_port_neigh_macs(void);

#ifdef __cplusplus
}
#endif
Expand Down
Loading
Loading