Skip to content

Conversation

ciarams87
Copy link
Contributor

Proposed changes

Problem: During an NGF upgrade, the new version of the control plane will send a configuration to the old version of the nginx data plane, before the nginx data plane is updated to the new version. This can cause incompatibility issues for a brief amount of time, which could cause disruptions.

Solution: Implement version validation by ensuring the pod image matches the image in the current deployment/ daemonset spec to prevent configuration from being sent to nginx data plane pods still running the previous image version during upgrades.

Testing: Manually tested upgrading in a cluster and verified that we don't send config to pods still running the previous image version

Closes #3867

Checklist

Before creating a PR, run through this checklist and mark each as complete.

  • I have read the CONTRIBUTING doc
  • I have added tests that prove my fix is effective or that my feature works
  • I have checked that all unit tests pass after adding my changes
  • I have updated necessary documentation
  • I have rebased my branch onto main
  • I will ensure my PR is targeting the main branch and pulling from my branch from my own fork

Release notes

If this PR introduces a change that affects users and needs to be mentioned in the release notes,
please add a brief note that summarizes the change.

Added nginx image version validation during agent connections to prevent newer config being sent to pods running previous image versions during upgrades

Implement version validation by ensuring the pod image matches the
image in the deployment/ daemonset spec to prevent configuration
from being sent to nginx data plane pods still running the previous
image version during upgrades.
@ciarams87 ciarams87 marked this pull request as ready for review September 17, 2025 09:01
@ciarams87 ciarams87 requested a review from a team as a code owner September 17, 2025 09:01
@github-actions github-actions bot added the bug Something isn't working label Sep 17, 2025
Copy link

codecov bot commented Sep 17, 2025

Codecov Report

❌ Patch coverage is 60.00000% with 36 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.63%. Comparing base (84a517f) to head (5cbcc17).

Files with missing lines Patch % Lines
internal/controller/nginx/agent/command.go 60.00% 26 Missing and 10 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3928      +/-   ##
==========================================
- Coverage   86.79%   86.63%   -0.16%     
==========================================
  Files         128      128              
  Lines       16503    16593      +90     
  Branches       62       62              
==========================================
+ Hits        14323    14375      +52     
- Misses       2000     2028      +28     
- Partials      180      190      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@@ -184,6 +192,13 @@ func (cs *commandService) Subscribe(in pb.CommandService_SubscribeServer) error
}

cs.logger.V(1).Info("Sending configuration to agent", "requestType", msg.Type)

if err := cs.validateAndHandleVersionMismatch(conn.PodName, conn.Parent); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By handling the error above, outside of the loop, is it even possible to hit this?

@@ -264,6 +279,11 @@ func (cs *commandService) setInitialConfig(
deployment.FileLock.Lock()
defer deployment.FileLock.Unlock()

if err := cs.validateAndHandleVersionMismatch(conn.PodName, conn.Parent); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question, since this gets called above before setInitialConfig.

// validatePodImageVersion checks if the pod's nginx container image version matches the expected version
// from its deployment. Returns an error if versions don't match.
func (cs *commandService) validatePodImageVersion(podName string, deploymentNSName types.NamespacedName) error {
ctx, cancel := context.WithTimeout(context.Background(), 30*time.Second)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's reuse the context from the parent call instead of Background. Also, 30s seems like a long time for a timeout. I think elsewhere we've used 10s (though it's also a bit arbitrary).


// Get all pods and find the one with the matching name
var pods v1.PodList
if err := cs.k8sReader.List(ctx, &pods); err != nil {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're already doing all of this when we call getPodOwner, so couldn't we just grab the image at the same time that we're doing that?

return nil
}

// getExpectedNginxImage retrieves the expected nginx container image from the deployment or daemonset.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem right, because the deployment won't be updated yet with the new image name until the control plane patches it, which happens after that initial config is sent. We should be checking the image value in the NginxProxy resource, and if that's empty, using our default value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right yeah, I wasn't thinking - I forgot that the existing nginx pods will try to connect before the control plane gets to reprovisiong, I was only thinking about the deployment itself scaling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, I'm not really sure how to approach this. It feels like a chicken and egg scenario - agent in the existing running NGINX pods is going to register potentially before we've even processed the nginxproxy resource, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could store the EffectiveNginxProxy for each Gateway on its associated Deployment object in the DeploymentStore. Then when we get an agent connection, do a lookup on the store for its config.

We already do a lookup into the deployment store in Subscribe(), so we're already waiting until that's been populated by the controller before we send anything to agent.

If there is no EffectiveNginxProxy set when we get the Deployment, then we assume the default value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working release-notes
Projects
Status: 🆕 New
Development

Successfully merging this pull request may close these issues.

Don't send nginx config when image versions mismatch
2 participants