-
Notifications
You must be signed in to change notification settings - Fork 138
feat: add nginx image version validation during agent connections #3928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Implement version validation by ensuring the pod image matches the image in the deployment/ daemonset spec to prevent configuration from being sent to nginx data plane pods still running the previous image version during upgrades.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3928 +/- ##
==========================================
- Coverage 86.79% 86.63% -0.16%
==========================================
Files 128 128
Lines 16503 16593 +90
Branches 62 62
==========================================
+ Hits 14323 14375 +52
- Misses 2000 2028 +28
- Partials 180 190 +10 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
@@ -184,6 +192,13 @@ func (cs *commandService) Subscribe(in pb.CommandService_SubscribeServer) error | |||
} | |||
|
|||
cs.logger.V(1).Info("Sending configuration to agent", "requestType", msg.Type) | |||
|
|||
if err := cs.validateAndHandleVersionMismatch(conn.PodName, conn.Parent); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By handling the error above, outside of the loop, is it even possible to hit this?
@@ -264,6 +279,11 @@ func (cs *commandService) setInitialConfig( | |||
deployment.FileLock.Lock() | |||
defer deployment.FileLock.Unlock() | |||
|
|||
if err := cs.validateAndHandleVersionMismatch(conn.PodName, conn.Parent); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question, since this gets called above before setInitialConfig
.
// validatePodImageVersion checks if the pod's nginx container image version matches the expected version | ||
// from its deployment. Returns an error if versions don't match. | ||
func (cs *commandService) validatePodImageVersion(podName string, deploymentNSName types.NamespacedName) error { | ||
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's reuse the context from the parent call instead of Background
. Also, 30s seems like a long time for a timeout. I think elsewhere we've used 10s (though it's also a bit arbitrary).
|
||
// Get all pods and find the one with the matching name | ||
var pods v1.PodList | ||
if err := cs.k8sReader.List(ctx, &pods); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're already doing all of this when we call getPodOwner
, so couldn't we just grab the image at the same time that we're doing that?
return nil | ||
} | ||
|
||
// getExpectedNginxImage retrieves the expected nginx container image from the deployment or daemonset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem right, because the deployment won't be updated yet with the new image name until the control plane patches it, which happens after that initial config is sent. We should be checking the image value in the NginxProxy resource, and if that's empty, using our default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh right yeah, I wasn't thinking - I forgot that the existing nginx pods will try to connect before the control plane gets to reprovisiong, I was only thinking about the deployment itself scaling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, I'm not really sure how to approach this. It feels like a chicken and egg scenario - agent in the existing running NGINX pods is going to register potentially before we've even processed the nginxproxy resource, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could store the EffectiveNginxProxy for each Gateway on its associated Deployment object in the DeploymentStore. Then when we get an agent connection, do a lookup on the store for its config.
We already do a lookup into the deployment store in Subscribe(), so we're already waiting until that's been populated by the controller before we send anything to agent.
If there is no EffectiveNginxProxy set when we get the Deployment, then we assume the default value.
Proposed changes
Problem: During an NGF upgrade, the new version of the control plane will send a configuration to the old version of the nginx data plane, before the nginx data plane is updated to the new version. This can cause incompatibility issues for a brief amount of time, which could cause disruptions.
Solution: Implement version validation by ensuring the pod image matches the image in the current deployment/ daemonset spec to prevent configuration from being sent to nginx data plane pods still running the previous image version during upgrades.
Testing: Manually tested upgrading in a cluster and verified that we don't send config to pods still running the previous image version
Closes #3867
Checklist
Before creating a PR, run through this checklist and mark each as complete.
Release notes
If this PR introduces a change that affects users and needs to be mentioned in the release notes,
please add a brief note that summarizes the change.