NVIDIA Network Operator

NVIDIA Network Operator

Table of contents generated with markdown-toc

NVIDIA Network Operator

NVIDIA Network Operator leverages Kubernetes CRDs and Operator SDK to manage Networking related Components in order to enable Fast networking, RDMA and GPUDirect for workloads in a Kubernetes cluster.

The Goal of Network Operator is to manage all networking related components to enable execution of RDMA and GPUDirect RDMA workloads in a kubernetes cluster including:

Mellanox Networking drivers to enable advanced features
Kubernetes device plugins to provide hardware resources for fast network
Kubernetes secondary network for Network intensive workloads

Documentation

For more information please visit the official documentation.

Prerequisites

Kubernetes Node Feature Discovery (NFD)

NVIDIA Network operator relies on Node labeling to get the cluster to the desired state. Node Feature Discovery v0.13.2 or newer is deployed by default via HELM chart installation. NFD is used to label nodes with the following labels:

PCI vendor and device information
RDMA capability
GPU features*

NOTE: We use nodeFeatureRules to label PCI vendor and device.This is enabled via nfd.deployNodeFeatureRules chart parameter.

Example NFD worker configurations:

    config:
      sources:
        pci:
          deviceClassWhitelist:
          - "0300"
          - "0302"
          deviceLabelFields:
          - vendor

* Required for GPUDirect driver container deployment

NOTE: If NFD is already deployed in the cluster, make sure to pass --set nfd.enabled=false to the helm install command to avoid conflicts, and if NFD is deployed from this repo the enableNodeFeatureApi flag is enabled by default to have the ability to create NodeFeatureRules.

Resource Definitions

The Operator Acts on the following CRDs:

NICClusterPolicy CRD

CRD that defines a Cluster state for Mellanox Network devices.

NOTE: The operator will act on a NicClusterPolicy instance with a predefined name "nic-cluster-policy", instances with different names will be ignored.

NICClusterPolicy spec

NICClusterPolicy CRD Spec includes the following sub-states:

ofedDriver: OFED driver container to be deployed on Mellanox supporting nodes.
rdmaSharedDevicePlugin: RDMA shared device plugin and related configurations.
sriovDevicePlugin: SR-IOV Network Device Plugin and related configurations.
ibKubernetes: InfiniBand Kubernetes and related configurations.
secondaryNetwork: Specifies components to deploy in order to facilitate a secondary network in Kubernetes. It consists of the following optionally deployed components:
- Multus-CNI: Delegate CNI plugin to support secondary networks in Kubernetes
- CNI plugins: Currently only containernetworking-plugins is supported
- IP Over Infiniband (IPoIB) CNI Plugin: Allow users to create an IPoIB child link and move it to the pod.
nvIpam: NVIDIA Kubernetes IPAM and related configurations.
nicConfigurationOperator: NVIDIA NIC Configuration Operator and related configuration.

NOTE: Any sub-state may be omitted if it is not required for the cluster.

NICClusterPolicy status

NICClusterPolicy status field reflects the current state of the system. It contains a per sub-state and a global state status.

The sub-state status indicates if the cluster has transitioned to the desired state for that sub-state, e.g OFED driver container deployed and loaded on relevant nodes, RDMA device plugin deployed and running on relevant nodes.

The global state reflects the logical AND of each individual sub-state.

Example Status field of a NICClusterPolicy instance

status:
  appliedStates:
  - name: state-pod-security-policy
    state: ignore
  - name: state-multus-cni
    state: ready
  - name: state-container-networking-plugins
    state: ignore
  - name: state-ipoib-cni
    state: ignore
  - name: state-OFED
    state: ready
  - name: state-SRIOV-device-plugin
    state: ignore
  - name: state-RDMA-device-plugin
    state: ready
  - name: state-ib-kubernetes
    state: ignore
  - name: state-nv-ipam-cni
    state: ready
  state: ready

NOTE: An ignore State indicates that the sub-state was not defined in the custom resource thus it is ignored.

MacvlanNetwork CRD

This CRD defines a MacVlan secondary network. It is translated by the Operator to a NetworkAttachmentDefinition instance as defined in k8snetworkplumbingwg/multi-net-spec.

MacvlanNetwork spec

MacvlanNetwork CRD Spec includes the following fields:

networkNamespace: Namespace for NetworkAttachmentDefinition related to this MacvlanNetwork CRD.
master: Name of the host interface to enslave. Defaults to default route interface.
mode: Mode of interface one of "bridge", "private", "vepa", "passthru", default "bridge".
mtu: MTU of interface to the specified value. 0 for master's MTU.
ipam: IPAM configuration to be used for this network.

HostDeviceNetwork CRD

This CRD defines a HostDevice secondary network. It is translated by the Operator to a NetworkAttachmentDefinition instance as defined in k8snetworkplumbingwg/multi-net-spec.

HostDeviceNetwork spec

HostDeviceNetwork CRD Spec includes the following fields:

networkNamespace: Namespace for NetworkAttachmentDefinition related to this HostDeviceNetwork CRD.
resourceName: Host device resource pool.
ipam: IPAM configuration to be used for this network.

IPoIBNetwork CRD

This CRD defines an IPoIBNetwork secondary network. It is translated by the Operator to a NetworkAttachmentDefinition instance as defined in k8snetworkplumbingwg/multi-net-spec.

IPoIBNetwork spec

HostDeviceNetwork CRD Spec includes the following fields:

networkNamespace: Namespace for NetworkAttachmentDefinition related to this HostDeviceNetwork CRD.
master: Name of the host interface to enslave.
ipam: IPAM configuration to be used for this network.

Examples

Sample CRDs can be found at: example/ directory.

System Requirements

RDMA capable hardware: Mellanox ConnectX-5 NIC or newer.
NVIDIA GPU and driver supporting GPUDirect e.g Quadro RTX 6000/8000 or Tesla T4 or Tesla V100 or Tesla V100. (GPU-Direct only)
Operating Systems: Ubuntu 20.04 LTS

NOTE: As more driver containers are built the operator will be able to support additional platforms. NOTE: ConnectX-6 Lx is not supported.

Tested Network Adapters

The following Network Adapters have been tested with NVIDIA Network Operator:

ConnectX-5
ConnectX-6 Dx

Compatibility Notes

NVIDIA Network Operator is compatible with NVIDIA GPU Operator v1.5.2 and above
Starting from v465 NVIDIA GPU driver includes a built-in nvidia_peermem module which is a replacement for nv_peer_mem module. NVIDIA GPU operator manages nvidia_peermem module loading.

Deployment Example

Deployment of NVIDIA Network Operator consists of:

Deploying NVIDIA Network Operator CRDs found under ./config/crd/bases:
- mellanox.com_nicclusterpolicies_crd.yaml
- mellanox.com_macvlan_crds.yaml
- k8s.cni.cncf.io-networkattachmentdefinitions-crd.yaml
Deploying network operator resources by running make deploy
Defining and deploying a NICClusterPolicy custom resource. Example can be found under ./example/crs/mellanox.com_v1alpha1_nicclusterpolicy_cr.yaml
Defining and deploying a MacvlanNetwork custom resource. Example can be found under ./example/crs/mellanox.com_v1alpha1_macvlannetwork_cr.yaml

A deployment example can be found under example folder here.

Docker image

To build a container image for Network Operator use:

make image

To build a multi-arch image and publish to a registry use:

export REGISTRY=example.com/registry 
export IMAGE_NAME=network-operator 
export VERSION=v1.1.1 
make image-build-multiarch image-push-multiarch

Driver Containers

Driver containers are essentially containers that have or yield kernel modules compatible with the underlying kernel. An initialization script loads the modules when the container is run (in privileged mode) making them available to the kernel.

While this approach may seem odd. It provides a way to deliver drivers to immutable systems.

Mellanox OFED container

Mellanox OFED driver container supports customization of its behaviour via environment variables. This is regarded as advanced functionallity and generally should not be needed.

check MOFED Driver Container Environment Variables

Upgrade

Check Upgrade section in Helm Chart documentation for details.

Drain Controller

In case users would like to use NVIDIA maintenance operator to manage node operations, e.g. SRIOV node draining, there is an option to disable SRIOV operator's internal drain controller and enabling network operator drain controller, which utilizes maintenance operator. To do so, the following environment variables are required by network operator pod:

# enable drain controller requestor
DRAIN_CONTROLLER_ENABLED="true"
# drain controller requestor ID to be used in nodeMaintenance objects
DRAIN_CONTROLLER_REQUESTOR_ID="nvidia.network-operator-drain-controller"
# k8s namespace to be used for generated nodeMaintenance objects
DRAIN_CONTROLLER_REQUESTOR_NAMESPACE=default
# k8s namespace used for generated SRIOV node state objects
DRAIN_CONTROLLER_SRIOV_NODE_STATE_NAMESPACE=nvidia-network-operator

Setting above environment variables can be used in values.yaml, in case network-operator is provisioned by Helm.

    # k8s namespace to be used for created nodeMaintenance objects
    nodeMaintenanceNamespace: default  
    # enable drain controller requestor
    useDrainControllerRequestor: false
    # drain controller requestor ID to be used in nodeMaintenance objects
    drainControllerRequestorID: "nvidia.network-operator-drain-controller"

Using requestor mode supports a shared-requestor flow where multiple operators can coordinate node maintenance operations: Assumptions:

Requestors are using same nodeMaintenance objects name
During DrainComplete completion:
- Non-owning operators remove themselves from nodeMaintenance spec.AdditionalRequestors list using patch with optimistic lock
The owning nodeMaintenance operator handles client side creation and deletion of the nodeMaintenance object

Note: owning operator It is the operator that managed to create the NodeMaintenance object. which means for a given NodeMaintenance obj (name of obj is the same in the shared-requestor mode for all cooperating operators) it is the operator whose RequestorID is set under spec.requestorID. non-owning operator It is the operator that did not create the NodeMaintenance object, which means for a given NodeMaintenance obj it is the operator whose RequestorID is present under spec.AdditionalRequestors

Externally Provided Configurations For Network Operator Sub-Components

In most cases, Network Operator will be deployed together with the related configurations for the various sub-components it deploys e.g. Nvidia k8s IPAM plugin, RDMA shared device plugin or SR-IOV Network device plugin.

Specifying configuration either via Helm values when installing NVIDIA network operator, or by specifying them when directly creating NicClusterPolicy CR. These configurations eventually trigger the creation of a ConfigMap object in K8s.

Note: It is the responsibility of the user to delete any existing configurations (ConfigMaps) if they were already created by the Network Operator as well as deleting his own configuration when they are no longer required.

Name		Name	Last commit message	Last commit date
Latest commit History 2,419 Commits
.github		.github
api/v1alpha1		api/v1alpha1
bundle		bundle
cmd		cmd
config		config
controllers		controllers
deployment/network-operator		deployment/network-operator
docs		docs
example		example
hack		hack
manifests		manifests
pkg		pkg
scripts/releases		scripts/releases
version		version
webhook-schemas		webhook-schemas
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
.golangci.yml		.golangci.yml
.mockery.yaml		.mockery.yaml
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
PROJECT		PROJECT
README.md		README.md
SECURITY.md		SECURITY.md
bundle.Dockerfile		bundle.Dockerfile
go.mod		go.mod
go.sum		go.sum
main.go		main.go
skaffold.yaml		skaffold.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NVIDIA Network Operator

Documentation

Prerequisites

Kubernetes Node Feature Discovery (NFD)

Resource Definitions

NICClusterPolicy CRD

NICClusterPolicy spec

NICClusterPolicy status

Example Status field of a NICClusterPolicy instance

MacvlanNetwork CRD

MacvlanNetwork spec

HostDeviceNetwork CRD

HostDeviceNetwork spec

IPoIBNetwork CRD

IPoIBNetwork spec

Examples

System Requirements

Tested Network Adapters

Compatibility Notes

Deployment Example

Docker image

Driver Containers

Upgrade

Drain Controller

Externally Provided Configurations For Network Operator Sub-Components

About

Uh oh!

Releases 125

Packages

Uh oh!

Contributors 28

Languages

License

Mellanox/network-operator

Folders and files

Latest commit

History

Repository files navigation

NVIDIA Network Operator

Documentation

Prerequisites

Kubernetes Node Feature Discovery (NFD)

Resource Definitions

NICClusterPolicy CRD

NICClusterPolicy spec

NICClusterPolicy status

Example Status field of a NICClusterPolicy instance

MacvlanNetwork CRD

MacvlanNetwork spec

HostDeviceNetwork CRD

HostDeviceNetwork spec

IPoIBNetwork CRD

IPoIBNetwork spec

Examples

System Requirements

Tested Network Adapters

Compatibility Notes

Deployment Example

Docker image

Driver Containers

Upgrade

Drain Controller

Externally Provided Configurations For Network Operator Sub-Components

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 125

Packages 0

Uh oh!

Contributors 28

Languages

Packages