Skip to content

Conversation

@hnattamaisub
Copy link

Problem description:
1)In some corner cases in zebra, “Failed to enqueue dataplane  install” happened for routes which was caused because of INVALID nhg.
2)Our NHG code logic, check all nexthop entries is valid or not,but it does not check whether the nhg itself is valid or not during our checks which in turn cause route install failures in zebra.

Fix:
Handled the nhg logic to check for validity during selection.

Copy link
Contributor

@mjstapp mjstapp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you say some more about the cases where this would happen - what's the order of events that get zebra into this condition?

@hnattamaisub
Copy link
Author

can you say some more about the cases where this would happen - what's the order of events that get zebra into this condition?

Sure, will explain more on what was happening.
Trigger: (Multiple Interface flaps sequentially between vrrp nodes in multiple node topo which has vrrp and ospf )
Issue: OSPF route was affected between some nodes. Ping failure due to lost routes.

1)When a nexthop becomes invalid (e.g., interface down), NHGs are NOT deleted immediately. Instead, they're marked as INVALID (flags=0x0) and cleaned up later.
During route churn, Zebra's RIB may temporarily have multiple route entries for the same prefix (some with valid NHGs, some with invalid NHGs). This is EXPECTED behavior due to our code design.

2)Zebra processes (rib_process) during that timeframe with invalid NHG and enqueue dataplane install failed.

Note (This is very very corner case , not hit very frequently , but this check makes sure VALID NHG is selected by zebra during that flap events , which in turn makes the installation success and avoided the issue scenario)

Please let me know if my understanding/approach is wrong or more details is needed. Thanks.

Copy link
Contributor

@mjstapp mjstapp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the explanation - that makes sense to me.

@donaldsharp
Copy link
Member

without this fix what state are we in?

@hnattamaisub
Copy link
Author

without this fix what state are we in?

Without this fix, we face "Failed to enqueue dataplane install" errors in zebra and ospf routes are not installed in some nodes causing traffic forwarding issue after interface down/up trigger. (More specifically ping failures itself).

@hnattamaisub
Copy link
Author

ci:rerun

@hnattamaisub hnattamaisub force-pushed the vrrp branch 2 times, most recently from e04ab3f to 16eeea1 Compare November 27, 2025 07:55
Problem description:
1)In some corner cases in zebra, “Failed to enqueue dataplane  install”
happened for routes which was caused because of INVALID nhg.
2)Our NHG code logic, check all nexthop entries is valid or not,but it
does not check whether the nhg itself is valid or not during our
checks which in turn cause route install failures in zebra.

Fix:
Handled the nhg logic to check for validity during selection.

Signed-off-by: harini <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants