-
Notifications
You must be signed in to change notification settings - Fork 132
Module graceful shutdown support #255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Module graceful shutdown support #255
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
Do you mind pasting the steps and output for testing (commands) in the PR description |
scripts/gnoi-reboot-daemon
Outdated
| try: | ||
| msg = json.loads(line) | ||
| dpu_ip = msg["dpu_ip"] | ||
| port = msg.get("port", 50052) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
port information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvolam No longer exists
scripts/gnoi-reboot-daemon
Outdated
| msg = json.loads(line) | ||
| dpu_ip = msg["dpu_ip"] | ||
| port = msg.get("port", 50052) | ||
| method = msg.get("method", 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HALT method is 3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvolam No longer this is valid and we use HALT method 3
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
scripts/check_platform.sh
Outdated
| if [[ "$subtype" == "SmartSwitch" && "$is_dpu" != "True" ]]; then | ||
| exit 0 | ||
| else | ||
| echo "gnoi-reboot-daemon is intended for SmartSwitch platforms only." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be debug level log, we don't want to see this on other platforms unnecessarily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vvolam Not valid anymore
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
46fc748 to
8d829cc
Compare
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 7 out of 7 changed files in this pull request and generated 2 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing! LGTM.
Co-authored-by: Copilot <[email protected]>
|
/azp run |
Co-authored-by: Copilot <[email protected]>
|
Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command. |
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Provide support for SmartSwitch DPU module graceful shutdown.
Description:
Single source of truth for transitions
All components now use
sonic_platform_base.module_base.ModuleBasehelpers:set_module_state_transition(db, name, transition_type)clear_module_state_transition(db, name)get_module_state_transition(db, name) -> dictis_module_state_transition_timed_out(db, name, timeout_secs) -> boolEliminates duplicated logic and race-prone direct Redis writes.
Correct table everywhere
CHASSIS_MODULE_TABLE(replacesCHASSIS_MODULE_INFO_TABLE).Ownership & lifecycle
The initiator of an operation (
startup/shutdown/reboot) sets:state_transition_in_progress=Truetransition_type=<op>transition_start_time=<utc-iso8601>The platform (
set_admin_state()) is responsible for clearing:state_transition_in_progress=Falsetransition_end_time=<epoch>(or similar end stamp).CLI pre-clears only when a prior transition is timed out.
Timeouts & policy
Platform JSON path only:
/usr/share/sonic/device/{plat}/platform.json; else constants.Typical production values used:
startup: 180s,shutdown: 180s(≈graceful_wait 60s + power 120s),reboot: 120s.Graceful wait (e.g., waiting for “Graceful shutdown complete”) is a platform policy and implemented inside platform
set_admin_state()—not in ModuleBase.Boot behavior
chassisdon start:set_initial_dpu_admin_state()which marks transitions via ModuleBase before calling platformset_admin_state().gNOI shutdown daemon
Listens on
CHASSIS_MODULE_TABLEand triggers only when:state_transition_in_progress=Trueandtransition_type=shutdown.Never clears the flag (ownership stays with the platform).
Bounded RPC timeouts and robust Redis access (swsssdk/swsscommon).
CLI (
config chassis modules …)is_module_state_transition_timed_out()→ auto-clear then proceed.startup/shutdown; platform clears on completion.Redis robustness
hset(mapping=...)usage.Race reduction & consistency
transition_start_time; clears may add an end stamp.Change scope
HLD: # 1991 sonic-net/SONiC#1991
sonic-platform-common: #567 sonic-net/sonic-platform-common#567
sonic-utilities: sonic-net/sonic-utilities#4031
sonic-platform-daemons: sonic-net/sonic-platform-daemons#667
How to verify it
Issue the "config chassis modules shutdown DPUx" command
Verify the DPU module is gracefully shut by checking the logs in /var/log/syslog on both NPU and DPU