Skip to content

Add support for distributed SR-SIM#239

Merged
carlmontanari merged 7 commits intosrl-labs:mainfrom
bayars:sr_sim_support
Jan 27, 2026
Merged

Add support for distributed SR-SIM#239
carlmontanari merged 7 commits intosrl-labs:mainfrom
bayars:sr_sim_support

Conversation

@bayars
Copy link
Contributor

@bayars bayars commented Jan 25, 2026

Hey!

First of all, I didn't test this fully yet. I need to test with a SR-SIM image, and I need to configure them. I will let you know when I test fully. I am creating this MR to discuss on my implementation way.

This MR adds support for deploying distributed chassis-based systems (like Nokia SR-SIM SR-7, SR-14s) that require multiple containers sharing a network namespace via Docker's network-mode: container: directive.

In containerlab (Docker environment), distributed chassis systems like Nokia SR-SIM SR-7 require multiple containers (CPM-A, CPM-B, IOM slots) that share the same Linux network namespace.

Current implementation is not working, because it's breaking distributed SR-SIM structure:

  1. Kubernetes Pods have isolated network namespaces
  2. network-mode: container: only works within the same Pod
  3. The containers couldn't share their network namespace across Pod boundaries

This MR implements automatic node grouping based on network-mode directives:

  1. Detection: Parse network-mode: container: to identify groups
  2. Grouping: Nodes referencing the same primary are grouped together
  3. Single Sub-Topology: All grouped nodes are placed in one containerlab sub-topology
  4. Single Pod: The launcher pod runs containerlab with all grouped nodes, enabling network namespace sharing

VXLAN Service Creation

Clabernetes uses VXLAN tunnels to connect nodes across different Pods. With this change:

  1. Link between nodes in same group: Local containerlab link (no VXLAN)
  2. Link between node in group and external node: VXLAN tunnel to external node's service
  3. VXLAN service name: Uses primary node name (e.g., topology-srsim-a-vx)

Example:
links:
- endpoints: ["srsim-a:1/1/c1/1", "srsim-b:1/1/c1/1"] # Local link (same group)
- endpoints: ["srsim-iom1:1/1/c2/1", "external-router:e1-1"] # VXLAN tunnel

The tunnel from external-router to srsim-iom1 resolves to srsim-a's VXLAN service since srsim-iom1 is grouped with srsim-a.

Limitations:

  1. All grouped nodes run on the same Kubernetes node: Since they're in the same Pod, they cannot be distributed across cluster nodes. This may impact resource availability for large chassis emulations.
  2. Resource limits apply to the entire group: CPU/memory limits are set on the Pod (primary node name), shared by all containers in the group.
  3. Single point of failure: If the launcher Pod crashes, all grouped nodes go down together.
  4. No independent scaling: Secondary nodes cannot be scaled independently of the primary.
  5. Circular dependencies not supported: network-mode references must form a tree (secondaries pointing to one primary), not cycles.
  6. Primary node must exist: If network-mode: container:foo references a node foo that doesn't exist in the topology, the behavior is undefined.

@bayars bayars marked this pull request as draft January 25, 2026 02:18
@bayars bayars marked this pull request as ready for review January 25, 2026 17:04
Comment on lines +571 to 583
// For grouped nodes (distributed systems like SR-SIM), it creates a single sub-topology
// containing all nodes in the group so they can be deployed in the same pod and share
// the network namespace.
// The secondaryNodes map is used to resolve tunnel destinations - if a remote node is a
// secondary, the tunnel should point to its primary's service instead.
func (p *containerlabDefinitionProcessor) processConfigForNodeGroup(
containerlabConfig *clabernetesutilcontainerlab.Config,
nodeName string,
primaryNodeName string,
group *nodeGroup,
secondaryNodes map[string]string,
defaultsYAML []byte,
removeTopologyPrefix bool,
) error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally, I added my changes under this function. The lint failed like that:

 controllers/topology/definitioncontainerlab.go:423:1: cognitive complexity 34 of func `(*containerlabDefinitionProcessor).processConfigForNodeGroup` is high (> 30) (gocognit)               

I tried to spread the functionalities to different functions.

I also realized some of the functions also similar. I might do other refactoring like this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this stuff has gotten a bit out of hand (not you, just in general I mean!) -- I think the general flow in this mr looks chill to me though!

@bayars
Copy link
Contributor Author

bayars commented Jan 25, 2026

Currently, I tested with all sr-sim labs in the containerlab. The clabverter seems working with it too.

Let me know if you have more corner cases, or different approaches. I can take a look on the code again.

Copy link
Contributor

@carlmontanari carlmontanari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, @hellt you wanna do a quick looksie too before we merge since this is a pretty big one?!

func buildGroupNodesList(
primaryNodeName string,
group *nodeGroup,
) (groupNodeNames []string, groupNodesSet map[string]struct{}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we've got a basic set implementation in util already. may shave some lines and just be more consistent. https://github.com/bayars/clabernetes/blob/853328b8b1d42b14cbca76fa5d714be77e1e299b/util/sets.go#L14

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thank you

I updated to use util.StringSet instead of map[string]struct{}.

I also changed buildGroupNodesList to use NewStringSetWithValues(groupNodeNames...) which replaces the manual map creation and loop with a single line.

Comment on lines +571 to 583
// For grouped nodes (distributed systems like SR-SIM), it creates a single sub-topology
// containing all nodes in the group so they can be deployed in the same pod and share
// the network namespace.
// The secondaryNodes map is used to resolve tunnel destinations - if a remote node is a
// secondary, the tunnel should point to its primary's service instead.
func (p *containerlabDefinitionProcessor) processConfigForNodeGroup(
containerlabConfig *clabernetesutilcontainerlab.Config,
nodeName string,
primaryNodeName string,
group *nodeGroup,
secondaryNodes map[string]string,
defaultsYAML []byte,
removeTopologyPrefix bool,
) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this stuff has gotten a bit out of hand (not you, just in general I mean!) -- I think the general flow in this mr looks chill to me though!

@carlmontanari
Copy link
Contributor

LGTM, thanks for all the work @bayars 🔥

@carlmontanari carlmontanari merged commit 8fe093b into srl-labs:main Jan 27, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants