Skip to content

Conversation

@DavidZagury
Copy link
Contributor

…r failures

Add enhanced error logging to diagnose vlanintf_validator failures when flushing neighbor entries for deleted VLAN interface IPs.

These minimal logging additions only activate on errors, providing zero overhead during normal operation while enabling root cause analysis when the validator fails.

This could help in the investigatation of sonic-net/sonic-buildimage#23680

What I did

Added diagnostic logging to vlanintf_validator to identify the root cause when neighbor flush operations fail during VLAN interface IP removal. Previously, the validator would fail silently with only a generic error message, making it impossible to diagnose the actual failure reason.

How I did it

  1. Enhanced command_wrapper() function to:

    • Capture stdout and stderr from subprocess calls by setting capture_output=True
    • Log the exact command, return code, stdout, and stderr when commands fail
    • Only log when errors occur (returncode != 0), maintaining zero overhead during normal operation
  2. Added error logging in vlanintf_validator() to:

    • Log which specific interface and IP address failed during neighbor flush
    • Include the return code for correlation with command_wrapper logs
  3. Updated unit tests in service_validator_test.py to reflect the original command format

How to verify it

  1. Normal operation (no errors):

    • Apply a VLAN interface configuration change
    • Verify no additional log messages appear when operation succeeds
  2. Error scenario:

    • Trigger a vlanintf_validator failure (e.g., by manipulating interface state during config update)
    • Verify detailed error logs appear showing:
      • The exact command that failed
      • The return code
      • stdout/stderr output from the command

Previous command output (if the output of a command-line utility has changed)

2025 Oct 25 18:04:21.266397 r-sn4700-72 ERR GenericConfigUpdater: Change Applier: service invoked: generic_config_updater.services_validator.vlanintf_validator failed with ret=False

New command output (if the output of a command-line utility has changed)

2025 Oct 25 18:04:21.266397 r-sn4700-72 ERR GenericConfigUpdater: Service Validator: Command failed: 'ip neigh flush dev Vlan1000 192.168.0.1/21', returncode: 2
2025 Oct 25 18:04:21.266398 r-sn4700-72 ERR GenericConfigUpdater: Service Validator: stderr: Cannot find device "Vlan1000"
2025 Oct 25 18:04:21.266399 r-sn4700-72 ERR GenericConfigUpdater: Service Validator: vlanintf_validator: Failed to flush neighbors for Vlan1000 192.168.0.1/21, returncode=2
2025 Oct 25 18:04:21.266400 r-sn4700-72 ERR GenericConfigUpdater: Change Applier: service invoked: generic_config_updater.services_validator.vlanintf_validator failed with ret=False

…r failures

Add enhanced error logging to diagnose vlanintf_validator failures when
flushing neighbor entries for deleted VLAN interface IPs.

These minimal logging additions only activate on errors, providing
zero overhead during normal operation while enabling root cause
analysis when the validator fails.
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants