-
Notifications
You must be signed in to change notification settings - Fork 299
Description
What would you like to be added:
When nodemanager encounters an error accessing the local metadata service it panics:
cloud-provider-azure/pkg/nodemanager/nodemanager.go
Lines 400 to 401 in 0ad85b3
| // Instead of just logging the error, panic and node manager can restart | |
| utilruntime.Must(fmt.Errorf("failed to initialize node %s at cloudprovider: %w", node.Name, err)) |
It does this intentionally, so this is technically not a bug. The purpose is to force the process to exit and retry.
This is just my supposition, but I suspect it does this because, the way the code is structured it would be complex to add a retry here. This code is called directly by a handler on an informer in response to a Node being added. These functions are not typically intended to be used this way. Consequently there are a number of error paths in this code which are currently only logged and never retried.
There is likely a task here to restructure this code to be more robust to errors, but I'm not asking for that here. I would simply like it to log the error and exit rather than panic.
Why is this needed:
This is really a cosmetic request. The current behaviour is correct. However, panics are ugly and extremely uncommon, so we have a generic task in OpenShift CI to look for them. They're treated as release blockers by default every time they occur, so it's a bit of a fuss to convince folks to ignore one.
Separately, as noted I think this code could do with some restructuring.