-
Notifications
You must be signed in to change notification settings - Fork 1.4k
zebra: nhg code selection doesnot check invalid status #20127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
mjstapp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you say some more about the cases where this would happen - what's the order of events that get zebra into this condition?
Sure, will explain more on what was happening. 1)When a nexthop becomes invalid (e.g., interface down), NHGs are NOT deleted immediately. Instead, they're marked as INVALID (flags=0x0) and cleaned up later. 2)Zebra processes (rib_process) during that timeframe with invalid NHG and enqueue dataplane install failed. Note (This is very very corner case , not hit very frequently , but this check makes sure VALID NHG is selected by zebra during that flap events , which in turn makes the installation success and avoided the issue scenario) Please let me know if my understanding/approach is wrong or more details is needed. Thanks. |
mjstapp
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the explanation - that makes sense to me.
|
without this fix what state are we in? |
Without this fix, we face "Failed to enqueue dataplane install" errors in zebra and ospf routes are not installed in some nodes causing traffic forwarding issue after interface down/up trigger. (More specifically ping failures itself). |
|
ci:rerun |
e04ab3f to
16eeea1
Compare
Problem description: 1)In some corner cases in zebra, “Failed to enqueue dataplane install” happened for routes which was caused because of INVALID nhg. 2)Our NHG code logic, check all nexthop entries is valid or not,but it does not check whether the nhg itself is valid or not during our checks which in turn cause route install failures in zebra. Fix: Handled the nhg logic to check for validity during selection. Signed-off-by: harini <[email protected]>
Problem description:
1)In some corner cases in zebra, “Failed to enqueue dataplane install” happened for routes which was caused because of INVALID nhg.
2)Our NHG code logic, check all nexthop entries is valid or not,but it does not check whether the nhg itself is valid or not during our checks which in turn cause route install failures in zebra.
Fix:
Handled the nhg logic to check for validity during selection.