Skip to content

Conversation

@rameshraghupathy
Copy link
Contributor

@rameshraghupathy rameshraghupathy commented May 14, 2025

Provide support for SmartSwitch DPU module graceful shutdown.

Description

  • Single source of truth for transitions

    • All components now use sonic_platform_base.module_base.ModuleBase helpers:

      • set_module_state_transition(db, name, transition_type)
      • clear_module_state_transition(db, name)
      • get_module_state_transition(db, name) -> dict
      • is_module_state_transition_timed_out(db, name, timeout_secs) -> bool
    • Eliminates duplicated logic and race-prone direct Redis writes.

  • Correct table everywhere

    • Standardized on CHASSIS_MODULE_TABLE (replaces CHASSIS_MODULE_INFO_TABLE).
    • HLD mismatch addressed in code (HLD fix tracked separately).
  • Ownership & lifecycle

    • The initiator of an operation (startup/shutdown/reboot) sets:

      • state_transition_in_progress=True
      • transition_type=<op>
      • transition_start_time=<utc-iso8601>
    • The platform (set_admin_state()) is responsible for clearing:

      • state_transition_in_progress=False
      • optionally transition_end_time=<epoch> (or similar end stamp).
    • CLI pre-clears only when a prior transition is timed out.

  • Timeouts & policy

    • Platform JSON path only: /usr/share/sonic/device/{plat}/platform.json; else constants.

    • Typical production values used:

      • startup: 180s, shutdown: 180s (≈ graceful_wait 60s + power 120s), reboot: 120s.
    • Graceful wait (e.g., waiting for “Graceful shutdown complete”) is a platform policy and implemented inside platform set_admin_state()—not in ModuleBase.

  • Boot behavior

    • chassisd on start:

      1. Clears stale flags once (centralized sweep).
      2. Runs set_initial_dpu_admin_state() which marks transitions via ModuleBase before calling platform set_admin_state().
      3. Leaves clearing to the platform or to well-defined status transitions (ONLINE/OFFLINE) where appropriate.
  • gNOI shutdown daemon

    • Listens on CHASSIS_MODULE_TABLE and triggers only when:

      • state_transition_in_progress=True and transition_type=shutdown.
    • Never clears the flag (ownership stays with the platform).

    • Bounded RPC timeouts and robust Redis access (swsssdk/swsscommon).

  • CLI (config chassis modules …)

    • Uses ModuleBase APIs for all set/get/timeout checks.
    • If a previous transition is stuck, is_module_state_transition_timed_out() → auto-clear then proceed.
    • Sets transition at the start of startup/shutdown; platform clears on completion.
    • Fabric card flow retained; edits are surgical.
  • Redis robustness

    • Helpers handle both stacks (swsssdk/swsscommon); no hset(mapping=...) usage.
    • Consistent HGETALL/HSET paths; resilient to connector differences.
  • Race reduction & consistency

    • Centralized writes prevent multi-writer races.
    • All transition writes include transition_start_time; clears may add an end stamp.
    • Existing PCI/file-lock logic left intact; unrelated behavior unchanged.
  • Change scope

    • Minimal, targeted diffs.
    • No background tasks added, no broad refactors beyond transition handling.
    • Behavior changes are limited to making transition semantics correct and uniform across repos.

HLD: # 1991 sonic-net/SONiC#1991
sonic-host-services: #255 sonic-net/sonic-host-services#255
sonic-platform-daemons: sonic-net/sonic-platform-daemons#667
sonic-utilities: sonic-net/sonic-utilities#4031

How Has This Been Tested?

Issue the "config chassis modules shutdown DPUx" command
Verify the DPU module is gracefully shut by checking the logs in /var/log/syslog on both NPU and DPU

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

# gnoi reboot pipe related
GNOI_REBOOT_PIPE_PATH = "/host/gnoi_reboot.pipe"
GNOI_REBOOT_RESPONSE_PIPE_PATH = "/host/gnoi_reboot_response.pipe"
GNOI_PORT = 50052
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required anymore. Cleaned it.

GNOI_REBOOT_PIPE_PATH = "/host/gnoi_reboot.pipe"
GNOI_REBOOT_RESPONSE_PIPE_PATH = "/host/gnoi_reboot_response.pipe"
GNOI_PORT = 50052
GNOI_RESPONSE_TIMEOUT = 60 # seconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

This method performs the following steps:
1. Sends a JSON-formatted reboot request to the gNOI reboot daemon via a named pipe.
2. Waits for a response on a designated response pipe, with a timeout of 60 seconds.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update the comment accordingly to platform.json timeout

"""
raise NotImplementedError

def pre_shutdown_hook(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this invoked?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refactored and not valid anymore

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines will not run the associated pipelines, because the pull request was updated after the run command was issued. Review the pull request again and issue a new run command.

subtype = device_info.get_device_subtype()
if subtype == "SmartSwitch" and not is_dpu():
self.graceful_shutdown_handler()
# Proceed to set the admin state using the platform-specific implementation
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this supposed to work? super here will call set_admin_state of the base class, not of the derived one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the refactored implementation the platform will graceful_shutdown_handler()

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Fixing test failures
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@vvolam vvolam requested a review from Copilot October 23, 2025 18:18
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

mb.ModuleBase._TRANSITION_TIMEOUTS_CACHE = None
with patch("os.path.exists", return_value=False):
d = Dummy()
assert d._load_transition_timeouts()["reboot"] == 240
Copy link

Copilot AI Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Magic number 240 should use the constant from ModuleBase._TRANSITION_TIMEOUT_DEFAULTS["reboot"] to ensure test stays in sync with actual default values and avoid hardcoding the same value in multiple places.

Copilot uses AI. Check for mistakes.
@mssonicbld
Copy link
Collaborator

/azp run

1 similar comment
@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

1 similar comment
@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@mssonicbld
Copy link
Collaborator

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

return admin_state_success

# Admin DOWN: Perform graceful shutdown first
module_name = self.get_name()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For uniformity, invoke set_module_state_transition here itself before calling graceful shutdown handler

PCIE_DETACH_INFO_TABLE_KEY = PCIE_DETACH_INFO_TABLE+"|"+pcie_string
if not self.state_db_connector:
self.state_db_connector = swsscommon.swsscommon.DBConnector("STATE_DB", 0)
db = self._state_db_connector
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just delete line 345 and 346 and simply replace state_db_connector with _state_db_connector or better just use prior name of state_db_connector(), so unnecessary code changes can be avoided.

self.state_db_connector.hset(PCIE_DETACH_INFO_TABLE_KEY, "bus_info", pcie_string)
self.state_db_connector.hset(PCIE_DETACH_INFO_TABLE_KEY, "dpu_state", operation)
# Set the PCI detach info for detaching operation
db.set(db.STATE_DB, PCIE_DETACH_INFO_TABLE_KEY, {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hset is the right way to set the keys in the STATE_DB, please avoid unrelated changes.

self.state_db_connector.delete(PCIE_DETACH_INFO_TABLE_KEY)
# Delete the entire entry for attaching operation
if hasattr(db, 'delete'):
db.delete(db.STATE_DB, PCIE_DETACH_INFO_TABLE_KEY, "bus_info")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 351 is correct, as the delete deletes entire PCIE_DETACH_INFO_TABLE_KEY

# Atomically set transition state (handles race conditions with locking)
# Note: This is safe to call even if caller already set transition state,
# as the function is idempotent and will not overwrite existing valid transitions
self.set_module_state_transition(db, module_name, "shutdown")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned above, this should be set before calling the graceful handler as we are clearing the flag outside itself

try:
oper = self.get_oper_status()
if oper and str(oper).lower() == "offline":
if not self.clear_module_state_transition(db, module_name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be duplicate code as the caller is clearing in False and True case, could you please re-check?

# This handles cases where multiple agents might be waiting
if self.is_module_state_transition_timed_out(db, module_name, shutdown_timeout):
# Clear only if we can confirm it's actually timed out
if not self.clear_module_state_transition(db, module_name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, caller is also clearing the state upon returning, this seems to be duplicate step


# Final timeout check before clearing - use recorded start time, not our local wait time
if self.is_module_state_transition_timed_out(db, module_name, shutdown_timeout):
if not self.clear_module_state_transition(db, module_name):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, seems to be duplicate

if up:
# Admin UP: Set transition state to 'startup' before admin state change
module_name = self.get_name()
self.set_module_state_transition(self._state_db_connector, module_name, "startup")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be aborting the operation if the set fails according to https://github.com/sonic-net/SONiC/blob/master/doc/smart-switch/graceful-shutdown/graceful-shutdown.md HLD "Scenario 1"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we need to set this transition state before pre-shutdown, because if setting state fails, we need to abort the operation. Should we leave set/clear transitions to the caller to correctly do pre-shutdown and post-startup sequence?

if t0.tzinfo is None:
t0 = t0.replace(tzinfo=timezone.utc)

age = (datetime.now(timezone.utc) - t0).total_seconds()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rameshraghupathy could you address comment from Qi

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants