-
Notifications
You must be signed in to change notification settings - Fork 198
Module graceful shutdown support #567
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Module graceful shutdown support #567
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
sonic_platform_base/module_base.py
Outdated
| # gnoi reboot pipe related | ||
| GNOI_REBOOT_PIPE_PATH = "/host/gnoi_reboot.pipe" | ||
| GNOI_REBOOT_RESPONSE_PIPE_PATH = "/host/gnoi_reboot_response.pipe" | ||
| GNOI_PORT = 50052 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default port is 8080 and please ready from Redis similar to https://github.com/sonic-net/sonic-utilities/blob/c78e0f73fece3fb1c6fb07718a64eddd337dae23/scripts/reboot_smartswitch_helper#L41C1-L45C2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not required anymore. Cleaned it.
sonic_platform_base/module_base.py
Outdated
| GNOI_REBOOT_PIPE_PATH = "/host/gnoi_reboot.pipe" | ||
| GNOI_REBOOT_RESPONSE_PIPE_PATH = "/host/gnoi_reboot_response.pipe" | ||
| GNOI_PORT = 50052 | ||
| GNOI_RESPONSE_TIMEOUT = 60 # seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please read the timeout from platform.json similar to https://github.com/sonic-net/sonic-utilities/blob/c78e0f73fece3fb1c6fb07718a64eddd337dae23/scripts/reboot_smartswitch_helper#L109C7-L109C52
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
sonic_platform_base/module_base.py
Outdated
| This method performs the following steps: | ||
| 1. Sends a JSON-formatted reboot request to the gNOI reboot daemon via a named pipe. | ||
| 2. Waits for a response on a designated response pipe, with a timeout of 60 seconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update the comment accordingly to platform.json timeout
sonic_platform_base/module_base.py
Outdated
| """ | ||
| raise NotImplementedError | ||
|
|
||
| def pre_shutdown_hook(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this invoked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refactored and not valid anymore
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command. |
|
/azp run |
|
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command. |
sonic_platform_base/module_base.py
Outdated
| subtype = device_info.get_device_subtype() | ||
| if subtype == "SmartSwitch" and not is_dpu(): | ||
| self.graceful_shutdown_handler() | ||
| # Proceed to set the admin state using the platform-specific implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How is this supposed to work? super here will call set_admin_state of the base class, not of the derived one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the refactored implementation the platform will graceful_shutdown_handler()
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Co-authored-by: Copilot <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Fixing test failures
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
tests/module_base_test.py
Outdated
| mb.ModuleBase._TRANSITION_TIMEOUTS_CACHE = None | ||
| with patch("os.path.exists", return_value=False): | ||
| d = Dummy() | ||
| assert d._load_transition_timeouts()["reboot"] == 240 |
Copilot
AI
Oct 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magic number 240 should use the constant from ModuleBase._TRANSITION_TIMEOUT_DEFAULTS["reboot"] to ensure test stays in sync with actual default values and avoid hardcoding the same value in multiple places.
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
|
/azp run |
1 similar comment
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
1 similar comment
|
Azure Pipelines successfully started running 1 pipeline(s). |
Co-authored-by: Copilot <[email protected]>
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| return admin_state_success | ||
|
|
||
| # Admin DOWN: Perform graceful shutdown first | ||
| module_name = self.get_name() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For uniformity, invoke set_module_state_transition here itself before calling graceful shutdown handler
| PCIE_DETACH_INFO_TABLE_KEY = PCIE_DETACH_INFO_TABLE+"|"+pcie_string | ||
| if not self.state_db_connector: | ||
| self.state_db_connector = swsscommon.swsscommon.DBConnector("STATE_DB", 0) | ||
| db = self._state_db_connector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just delete line 345 and 346 and simply replace state_db_connector with _state_db_connector or better just use prior name of state_db_connector(), so unnecessary code changes can be avoided.
| self.state_db_connector.hset(PCIE_DETACH_INFO_TABLE_KEY, "bus_info", pcie_string) | ||
| self.state_db_connector.hset(PCIE_DETACH_INFO_TABLE_KEY, "dpu_state", operation) | ||
| # Set the PCI detach info for detaching operation | ||
| db.set(db.STATE_DB, PCIE_DETACH_INFO_TABLE_KEY, { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hset is the right way to set the keys in the STATE_DB, please avoid unrelated changes.
| self.state_db_connector.delete(PCIE_DETACH_INFO_TABLE_KEY) | ||
| # Delete the entire entry for attaching operation | ||
| if hasattr(db, 'delete'): | ||
| db.delete(db.STATE_DB, PCIE_DETACH_INFO_TABLE_KEY, "bus_info") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 351 is correct, as the delete deletes entire PCIE_DETACH_INFO_TABLE_KEY
| # Atomically set transition state (handles race conditions with locking) | ||
| # Note: This is safe to call even if caller already set transition state, | ||
| # as the function is idempotent and will not overwrite existing valid transitions | ||
| self.set_module_state_transition(db, module_name, "shutdown") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned above, this should be set before calling the graceful handler as we are clearing the flag outside itself
| try: | ||
| oper = self.get_oper_status() | ||
| if oper and str(oper).lower() == "offline": | ||
| if not self.clear_module_state_transition(db, module_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be duplicate code as the caller is clearing in False and True case, could you please re-check?
| # This handles cases where multiple agents might be waiting | ||
| if self.is_module_state_transition_timed_out(db, module_name, shutdown_timeout): | ||
| # Clear only if we can confirm it's actually timed out | ||
| if not self.clear_module_state_transition(db, module_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, caller is also clearing the state upon returning, this seems to be duplicate step
|
|
||
| # Final timeout check before clearing - use recorded start time, not our local wait time | ||
| if self.is_module_state_transition_timed_out(db, module_name, shutdown_timeout): | ||
| if not self.clear_module_state_transition(db, module_name): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, seems to be duplicate
| if up: | ||
| # Admin UP: Set transition state to 'startup' before admin state change | ||
| module_name = self.get_name() | ||
| self.set_module_state_transition(self._state_db_connector, module_name, "startup") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be aborting the operation if the set fails according to https://github.com/sonic-net/SONiC/blob/master/doc/smart-switch/graceful-shutdown/graceful-shutdown.md HLD "Scenario 1"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we need to set this transition state before pre-shutdown, because if setting state fails, we need to abort the operation. Should we leave set/clear transitions to the caller to correctly do pre-shutdown and post-startup sequence?
| if t0.tzinfo is None: | ||
| t0 = t0.replace(tzinfo=timezone.utc) | ||
|
|
||
| age = (datetime.now(timezone.utc) - t0).total_seconds() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rameshraghupathy could you address comment from Qi
Provide support for SmartSwitch DPU module graceful shutdown.
Description
Single source of truth for transitions
All components now use
sonic_platform_base.module_base.ModuleBasehelpers:set_module_state_transition(db, name, transition_type)clear_module_state_transition(db, name)get_module_state_transition(db, name) -> dictis_module_state_transition_timed_out(db, name, timeout_secs) -> boolEliminates duplicated logic and race-prone direct Redis writes.
Correct table everywhere
CHASSIS_MODULE_TABLE(replacesCHASSIS_MODULE_INFO_TABLE).Ownership & lifecycle
The initiator of an operation (
startup/shutdown/reboot) sets:state_transition_in_progress=Truetransition_type=<op>transition_start_time=<utc-iso8601>The platform (
set_admin_state()) is responsible for clearing:state_transition_in_progress=Falsetransition_end_time=<epoch>(or similar end stamp).CLI pre-clears only when a prior transition is timed out.
Timeouts & policy
Platform JSON path only:
/usr/share/sonic/device/{plat}/platform.json; else constants.Typical production values used:
startup: 180s,shutdown: 180s(≈graceful_wait 60s + power 120s),reboot: 120s.Graceful wait (e.g., waiting for “Graceful shutdown complete”) is a platform policy and implemented inside platform
set_admin_state()—not in ModuleBase.Boot behavior
chassisdon start:set_initial_dpu_admin_state()which marks transitions via ModuleBase before calling platformset_admin_state().gNOI shutdown daemon
Listens on
CHASSIS_MODULE_TABLEand triggers only when:state_transition_in_progress=Trueandtransition_type=shutdown.Never clears the flag (ownership stays with the platform).
Bounded RPC timeouts and robust Redis access (swsssdk/swsscommon).
CLI (
config chassis modules …)is_module_state_transition_timed_out()→ auto-clear then proceed.startup/shutdown; platform clears on completion.Redis robustness
hset(mapping=...)usage.Race reduction & consistency
transition_start_time; clears may add an end stamp.Change scope
HLD: # 1991 sonic-net/SONiC#1991
sonic-host-services: #255 sonic-net/sonic-host-services#255
sonic-platform-daemons: sonic-net/sonic-platform-daemons#667
sonic-utilities: sonic-net/sonic-utilities#4031
How Has This Been Tested?
Issue the "config chassis modules shutdown DPUx" command
Verify the DPU module is gracefully shut by checking the logs in /var/log/syslog on both NPU and DPU