feat: add nginx image version validation during agent connections #3928

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

ciarams87 wants to merge 1 commit into main from bug/mismatch-config

Contributor

ciarams87 commented Sep 17, 2025

Proposed changes

Problem: During an NGF upgrade, the new version of the control plane will send a configuration to the old version of the nginx data plane, before the nginx data plane is updated to the new version. This can cause incompatibility issues for a brief amount of time, which could cause disruptions.

Solution: Implement version validation by ensuring the pod image matches the image in the current deployment/ daemonset spec to prevent configuration from being sent to nginx data plane pods still running the previous image version during upgrades.

Testing: Manually tested upgrading in a cluster and verified that we don't send config to pods still running the previous image version

Closes #3867

Checklist

Before creating a PR, run through this checklist and mark each as complete.

I have read the CONTRIBUTING doc
I have added tests that prove my fix is effective or that my feature works
I have checked that all unit tests pass after adding my changes
I have updated necessary documentation
I have rebased my branch onto main
I will ensure my PR is targeting the main branch and pulling from my branch from my own fork

Release notes

If this PR introduces a change that affects users and needs to be mentioned in the release notes,
please add a brief note that summarizes the change.

Added nginx image version validation during agent connections to prevent newer config being sent to pods running previous image versions during upgrades


          feat: add nginx image version validation during agent connections

5cbcc17

Implement version validation by ensuring the pod image matches the
image in the deployment/ daemonset spec to prevent configuration
from being sent to nginx data plane pods still running the previous
image version during upgrades.

github-project-automation bot added this to NGINX Gateway Fabric

github-project-automation bot moved this to 🆕 New in NGINX Gateway Fabric

nginx-bot bot added the release-notes label

ciarams87 marked this pull request as ready for review

September 17, 2025 09:01

ciarams87 requested a review from a team as a code owner

September 17, 2025 09:01

github-actions bot added the bug label

codecov bot commented Sep 17, 2025 •

edited

Loading

Codecov Report

❌ Patch coverage is 60.00000% with 36 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.63%. Comparing base (84a517f) to head (5cbcc17).

Files with missing lines	Patch %	Lines
internal/controller/nginx/agent/command.go	60.00%	26 Missing and 10 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3928      +/-   ##
==========================================
- Coverage   86.79%   86.63%   -0.16%     
==========================================
  Files         128      128              
  Lines       16503    16593      +90     
  Branches       62       62              
==========================================
+ Hits        14323    14375      +52     
- Misses       2000     2028      +28     
- Partials      180      190      +10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

sjberman reviewed

View reviewed changes

internal/controller/nginx/agent/command.go

               			}
               			cs.logger.V(1).Info("Sending configuration to agent", "requestType", msg.Type)
+              			if err := cs.validateAndHandleVersionMismatch(conn.PodName, conn.Parent); err != nil {

Collaborator

sjberman Sep 17, 2025

By handling the error above, outside of the loop, is it even possible to hit this?

internal/controller/nginx/agent/command.go

@@ @@ -264,6 +279,11 @@ func (cs *commandService) setInitialConfig( @@
               	deployment.FileLock.Lock()
               	defer deployment.FileLock.Unlock()
+              	if err := cs.validateAndHandleVersionMismatch(conn.PodName, conn.Parent); err != nil {

Collaborator

sjberman Sep 17, 2025

Same question, since this gets called above before setInitialConfig.

internal/controller/nginx/agent/command.go

+              // validatePodImageVersion checks if the pod's nginx container image version matches the expected version
+              // from its deployment. Returns an error if versions don't match.
+              func (cs *commandService) validatePodImageVersion(podName string, deploymentNSName types.NamespacedName) error {
+              	ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)

Collaborator

sjberman Sep 17, 2025

Let's reuse the context from the parent call instead of Background. Also, 30s seems like a long time for a timeout. I think elsewhere we've used 10s (though it's also a bit arbitrary).

internal/controller/nginx/agent/command.go

+              	// Get all pods and find the one with the matching name
+              	var pods v1.PodList
+              	if err := cs.k8sReader.List(ctx, &pods); err != nil {

Collaborator

sjberman Sep 17, 2025

We're already doing all of this when we call getPodOwner, so couldn't we just grab the image at the same time that we're doing that?

internal/controller/nginx/agent/command.go

+              	return nil
+              }
+              // getExpectedNginxImage retrieves the expected nginx container image from the deployment or daemonset.

Collaborator

sjberman Sep 17, 2025

This doesn't seem right, because the deployment won't be updated yet with the new image name until the control plane patches it, which happens after that initial config is sent. We should be checking the image value in the NginxProxy resource, and if that's empty, using our default value.

Contributor Author

ciarams87 Sep 18, 2025

Oh right yeah, I wasn't thinking - I forgot that the existing nginx pods will try to connect before the control plane gets to reprovisiong, I was only thinking about the deployment itself scaling.

Contributor Author

ciarams87 Sep 18, 2025

In that case, I'm not really sure how to approach this. It feels like a chicken and egg scenario - agent in the existing running NGINX pods is going to register potentially before we've even processed the nginxproxy resource, right?

Collaborator

sjberman Sep 18, 2025

Maybe we could store the EffectiveNginxProxy for each Gateway on its associated Deployment object in the DeploymentStore. Then when we get an agent connection, do a lookup on the store for its config.

We already do a lookup into the deployment store in Subscribe(), so we're already waiting until that's been populated by the controller before we send anything to agent.

If there is no EffectiveNginxProxy set when we get the Deployment, then we assume the default value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug release-notes