Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
63 commits
Select commit Hold shift + click to select a range
4e3a096
Did the instrumentation for gnoi-reboot.service
rameshraghupathy May 13, 2025
4a7e6bf
Modified based on the Redis based IPC
rameshraghupathy May 21, 2025
c2f9cb8
Modified based on the Redis based IPC
rameshraghupathy May 21, 2025
db7848f
made check_platform.sh executable
rameshraghupathy May 21, 2025
f946e72
Did some cleanup
rameshraghupathy May 21, 2025
4434463
Draft version. Need to test again
rameshraghupathy Jul 7, 2025
91897ed
Fixing test failure
rameshraghupathy Jul 10, 2025
118a27a
Working on coverage
rameshraghupathy Jul 10, 2025
1654d44
Working on coverage
rameshraghupathy Jul 10, 2025
b1ca2a3
Merge branch 'sonic-net:master' into graceful-shutdown
rameshraghupathy Aug 12, 2025
f6936e5
refactored based on the revised HLD
rameshraghupathy Aug 12, 2025
4b709ea
refactored based on the revised HLD
rameshraghupathy Aug 14, 2025
d510290
Fixing ut
rameshraghupathy Aug 20, 2025
dfa9761
Fixing ut
rameshraghupathy Aug 20, 2025
380b5f9
Improving coverage
rameshraghupathy Aug 20, 2025
62450d6
Refactored for graceful shutdown
rameshraghupathy Aug 24, 2025
a7f1a39
Refactored for graceful shutdown
rameshraghupathy Aug 25, 2025
f45358a
Fixing ut
rameshraghupathy Aug 26, 2025
14f20e6
Fixing ut
rameshraghupathy Aug 26, 2025
8d647fa
Fixing ut
rameshraghupathy Aug 26, 2025
e2c2a71
Fixing ut
rameshraghupathy Aug 26, 2025
ada6883
Fixing ut
rameshraghupathy Aug 26, 2025
ca6d463
Fixing ut
rameshraghupathy Aug 26, 2025
e2bbe5f
Fixing ut
rameshraghupathy Aug 26, 2025
28bc69b
Fixing ut
rameshraghupathy Aug 26, 2025
29183bd
Fixing ut
rameshraghupathy Aug 26, 2025
e228ffb
workign on coverage
rameshraghupathy Aug 26, 2025
37d73ce
workign on coverage
rameshraghupathy Aug 26, 2025
601cb90
workign on coverage
rameshraghupathy Aug 26, 2025
dfda223
workign on coverage
rameshraghupathy Aug 26, 2025
fb51c33
Refactored for graceful shutdown, fixing UT - Final round of tweaks
rameshraghupathy Sep 8, 2025
4650d23
Refactored for graceful shutdown, fixing UT - Final round of tweaks
rameshraghupathy Sep 8, 2025
dece2a0
Refactored for graceful shutdown, fixing UT - Final round of tweaks
rameshraghupathy Sep 9, 2025
6a8524f
Refactored for graceful shutdown, fixing UT - Final round of tweaks
rameshraghupathy Sep 9, 2025
a381400
Refactored for graceful shutdown, fixing UT - Final round of tweaks
rameshraghupathy Sep 9, 2025
da39422
Refactored for graceful shutdown, fixing UT - Final round of tweaks
rameshraghupathy Sep 10, 2025
d5ab77b
Refactored for graceful shutdown, fixing UT - Final round of tweaks
rameshraghupathy Sep 10, 2025
78de30a
Refactored for graceful shutdown, fixing UT - Final round of tweaks
rameshraghupathy Sep 10, 2025
39db631
Refactored for graceful shutdown, fixing UT - Final round of tweaks
rameshraghupathy Sep 10, 2025
ee497b9
Refactored for graceful shutdown, fixing UT - Final round of tweaks
rameshraghupathy Sep 10, 2025
e5558b6
Refactored for graceful shutdown, fixing UT - Final round of tweaks
rameshraghupathy Sep 10, 2025
05571bb
Refactored for graceful shutdown, fixing UT - Final round of tweaks
rameshraghupathy Sep 10, 2025
7285eda
Refactored for graceful shutdown, fixing UT - Final round of tweaks
rameshraghupathy Sep 10, 2025
2009207
Refactored for graceful shutdown, fixing UT - Final round of tweaks
rameshraghupathy Sep 10, 2025
2470888
Addressed copilot PR comments
rameshraghupathy Sep 15, 2025
c62e79f
Made the timeout logic common
rameshraghupathy Sep 20, 2025
2106099
working on coverage
rameshraghupathy Sep 20, 2025
ffe85ec
working on coverage
rameshraghupathy Sep 20, 2025
22654c8
working on coverage
rameshraghupathy Sep 20, 2025
cac4b67
Addressed PR comments
rameshraghupathy Sep 26, 2025
6d46f60
Addressed review comments related to refactoring
rameshraghupathy Oct 1, 2025
4b092dc
Fixing test failures
rameshraghupathy Oct 1, 2025
b0bfd18
Fixing test failures
rameshraghupathy Oct 1, 2025
aeac810
Addressed review comments related to refactoring
rameshraghupathy Oct 1, 2025
5c98c46
Addressing review comments
rameshraghupathy Oct 21, 2025
8d829cc
Addressing review comments
rameshraghupathy Oct 21, 2025
942874c
Addressing review comments
rameshraghupathy Oct 21, 2025
d1533a8
Addressing review comments
rameshraghupathy Oct 21, 2025
8454a37
Addressing review comments
rameshraghupathy Oct 21, 2025
7e3bf57
Addressing review comments
rameshraghupathy Oct 21, 2025
3c93891
Addressing review comments
rameshraghupathy Oct 21, 2025
b1f6139
Update scripts/wait-for-sonic-core.sh
rameshraghupathy Oct 21, 2025
6a76f95
Update scripts/wait-for-sonic-core.sh
rameshraghupathy Oct 21, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions data/debian/rules
Original file line number Diff line number Diff line change
Expand Up @@ -20,5 +20,6 @@ override_dh_installsystemd:
dh_installsystemd --no-start --name=procdockerstatsd
dh_installsystemd --no-start --name=determine-reboot-cause
dh_installsystemd --no-start --name=process-reboot-cause
dh_installsystemd --no-start --name=gnoi-shutdown
dh_installsystemd $(HOST_SERVICE_OPTS) --name=sonic-hostservice

16 changes: 16 additions & 0 deletions data/debian/sonic-host-services-data.gnoi-shutdown.service
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
[Unit]
Description=gNOI based DPU Graceful Shutdown Daemon
Requires=database.service
Wants=network-online.target
After=network-online.target database.service

[Service]
Type=simple
ExecStartPre=/usr/local/bin/check_platform.py
ExecStartPre=/usr/local/bin/wait-for-sonic-core.sh
ExecStart=/usr/local/bin/gnoi-shutdown-daemon
Restart=always
RestartSec=5

[Install]
WantedBy=multi-user.target
36 changes: 36 additions & 0 deletions scripts/check_platform.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
#!/usr/bin/env python3
"""
Check if the current platform is a SmartSwitch NPU (not DPU).
Exit 0 if SmartSwitch NPU, exit 1 otherwise.
"""
import sys
import subprocess

def main():
try:
# Get subtype from config
result = subprocess.run(
['sonic-cfggen', '-d', '-v', 'DEVICE_METADATA.localhost.subtype'],
capture_output=True,
text=True,
timeout=5
)
subtype = result.stdout.strip()

# Check if DPU
try:
from utilities_common.chassis import is_dpu
is_dpu_platform = is_dpu()
except Exception:
is_dpu_platform = False

# Check if SmartSwitch NPU (not DPU)
if subtype == "SmartSwitch" and not is_dpu_platform:
sys.exit(0)
else:
sys.exit(1)
except Exception:
sys.exit(1)

if __name__ == "__main__":
main()
331 changes: 331 additions & 0 deletions scripts/gnoi_shutdown_daemon.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,331 @@
#!/usr/bin/env python3
"""
gnoi-shutdown-daemon

Listens for CHASSIS_MODULE_TABLE state changes in STATE_DB and, when a
SmartSwitch DPU module enters a "shutdown" transition, issues a gNOI Reboot
(method HALT) toward that DPU and polls RebootStatus until complete or timeout.

Additionally, a lightweight background thread periodically enforces timeout
clearing of stuck transitions (startup/shutdown/reboot) using ModuleBase’s
common APIs, so all code paths (CLI, chassisd, platform, gNOI) benefit.
"""

import json
import time
import subprocess
import socket
import os
import threading

REBOOT_RPC_TIMEOUT_SEC = 60 # gNOI System.Reboot call timeout
STATUS_POLL_TIMEOUT_SEC = 60 # overall time - polling RebootStatus
STATUS_POLL_INTERVAL_SEC = 5 # delay between polls
STATUS_RPC_TIMEOUT_SEC = 10 # per RebootStatus RPC timeout
REBOOT_METHOD_HALT = 3 # gNOI System.Reboot method: HALT

from swsscommon.swsscommon import SonicV2Connector
from sonic_py_common import syslogger
# Centralized transition API on ModuleBase
from sonic_platform_base.module_base import ModuleBase

_v2 = None
SYSLOG_IDENTIFIER = "gnoi-shutdown-daemon"
logger = syslogger.SysLogger(SYSLOG_IDENTIFIER)

# ##########
# helper
# ##########
def is_tcp_open(host: str, port: int, timeout: float = None) -> bool:
"""Fast reachability test for <host,port>. No side effects."""
if timeout is None:
timeout = float(os.getenv("GNOI_DIAL_TIMEOUT", "1.0"))
try:
with socket.create_connection((host, port), timeout=timeout):
return True
except OSError:
return False

# ##########
# DB helpers
# ##########

def _get_dbid_state(db) -> int:
"""Resolve STATE_DB numeric ID across connector implementations."""
try:
return db.get_dbid(db.STATE_DB)
except Exception:
# Default STATE_DB index in SONiC redis instances
return 6

def _get_pubsub(db):
"""Return a pubsub object for keyspace notifications.

Prefer a direct pubsub() if the connector exposes one; otherwise,
fall back to the raw redis client's pubsub().
"""
try:
return db.pubsub() # some connectors expose pubsub()
except AttributeError:
client = db.get_redis_client(db.STATE_DB)
return client.pubsub()

def _cfg_get_entry(table, key):
"""Read CONFIG_DB row via unix-socket V2 API and normalize to str."""
global _v2
if _v2 is None:
from swsscommon import swsscommon
_v2 = swsscommon.SonicV2Connector(use_unix_socket_path=True)
_v2.connect(_v2.CONFIG_DB)
raw = _v2.get_all(_v2.CONFIG_DB, f"{table}|{key}") or {}
def _s(x): return x.decode("utf-8", "ignore") if isinstance(x, (bytes, bytearray)) else x
return {_s(k): _s(v) for k, v in raw.items()}

# ############
# gNOI helpers
# ############

def execute_gnoi_command(command_args, timeout_sec=REBOOT_RPC_TIMEOUT_SEC):
"""Run gnoi_client with a timeout; return (rc, stdout, stderr)."""
try:
result = subprocess.run(command_args, capture_output=True, text=True, timeout=timeout_sec)
return result.returncode, result.stdout.strip(), result.stderr.strip()
except subprocess.TimeoutExpired as e:
return -1, "", f"Command timed out after {int(e.timeout)}s."
except Exception as e:
return -2, "", f"Command failed: {e}"

def get_dpu_ip(dpu_name: str):
entry = _cfg_get_entry("DHCP_SERVER_IPV4_PORT", f"bridge-midplane|{dpu_name.lower()}")
return entry.get("ips@")

def get_dpu_gnmi_port(dpu_name: str):
variants = [dpu_name, dpu_name.lower(), dpu_name.upper()]
for k in variants:
entry = _cfg_get_entry("DPU_PORT", k)
if entry and entry.get("gnmi_port"):
return str(entry.get("gnmi_port"))
return "8080"

# ###############
# Timeout Enforcer
# ###############
class TimeoutEnforcer(threading.Thread):
"""
Periodically enforces CHASSIS_MODULE_TABLE transition timeouts for all modules.
Uses ModuleBase’s common helpers so all code paths benefit (CLI, chassisd, platform, gNOI).
"""
def __init__(self, db, module_base: ModuleBase, interval_sec: int = 5):
super().__init__(daemon=True, name="timeout-enforcer")
self._db = db
self._mb = module_base
self._interval = max(1, int(interval_sec))
self._stop = threading.Event()

def stop(self):
self._stop.set()

def _list_modules(self):
"""Discover module names by scanning CHASSIS_MODULE_TABLE keys."""
try:
client = self._db.get_redis_client(self._db.STATE_DB)
keys = client.keys("CHASSIS_MODULE_TABLE|*")
out = []
for k in keys or []:
if isinstance(k, (bytes, bytearray)):
k = k.decode("utf-8", "ignore")
_, _, name = k.partition("|")
if name:
out.append(name)
return sorted(out)
except Exception:
return []

def run(self):
while not self._stop.is_set():
try:
for name in self._list_modules():
try:
entry = self._mb.get_module_state_transition(self._db, name) or {}
inprog = str(entry.get("state_transition_in_progress", "")).lower() in ("1", "true", "yes", "on")
if not inprog:
continue
op = entry.get("transition_type", "startup")
timeouts = self._mb._load_transition_timeouts()
# Fallback safely to defaults if key missing/unknown
timeout_sec = int(timeouts.get(op, ModuleBase._TRANSITION_TIMEOUT_DEFAULTS.get(op, 300)))
if self._mb.is_module_state_transition_timed_out(self._db, name, timeout_sec):
success = self._mb.clear_module_state_transition(self._db, name)
if success:
logger.log_info(f"Cleared transition after timeout for {name}")
else:
logger.log_warning(f"Failed to clear transition timeout for {name}")
except Exception as e:
# Keep loop resilient; log at debug noise level
logger.log_debug(f"Timeout enforce error for {name}: {e}")
except Exception as e:
logger.log_debug(f"TimeoutEnforcer loop error: {e}")
self._stop.wait(self._interval)

# ###############
# gNOI Reboot Handler
# ###############
class GnoiRebootHandler:
"""
Handles gNOI reboot operations for DPU modules, including sending reboot commands
and polling for status completion.
"""
def __init__(self, db, module_base: ModuleBase):
self._db = db
self._mb = module_base

def handle_transition(self, dpu_name: str, transition_type: str) -> bool:
"""
Handle a shutdown or reboot transition for a DPU module.
Returns True if the operation completed successfully, False otherwise.
"""
try:
dpu_ip = get_dpu_ip(dpu_name)
port = get_dpu_gnmi_port(dpu_name)
if not dpu_ip:
raise RuntimeError("DPU IP not found")
except Exception as e:
logger.log_error(f"Error getting DPU IP or port for {dpu_name}: {e}")
return False

# skip if TCP is not reachable
if not is_tcp_open(dpu_ip, int(port)):
logger.log_info(f"Skipping {dpu_name}: {dpu_ip}:{port} unreachable (offline/down)")
return False

# Send Reboot HALT
if not self._send_reboot_command(dpu_name, dpu_ip, port):
return False

# Poll RebootStatus
reboot_successful = self._poll_reboot_status(dpu_name, dpu_ip, port)

if reboot_successful:
self._handle_successful_reboot(dpu_name, transition_type)
else:
logger.log_warning(f"Status polling of halting the services on DPU timed out for {dpu_name}.")

return reboot_successful

def _send_reboot_command(self, dpu_name: str, dpu_ip: str, port: str) -> bool:
"""Send gNOI Reboot HALT command to the DPU."""
logger.log_notice(f"Issuing gNOI Reboot to {dpu_ip}:{port}")
reboot_cmd = [
"docker", "exec", "gnmi", "gnoi_client",
f"-target={dpu_ip}:{port}",
"-logtostderr", "-notls",
"-module", "System",
"-rpc", "Reboot",
"-jsonin", json.dumps({"method": REBOOT_METHOD_HALT, "message": "Triggered by SmartSwitch graceful shutdown"})
]
rc, out, err = execute_gnoi_command(reboot_cmd, timeout_sec=REBOOT_RPC_TIMEOUT_SEC)
if rc != 0:
logger.log_error(f"gNOI Reboot command failed for {dpu_name}: {err or out}")
return False
return True

def _poll_reboot_status(self, dpu_name: str, dpu_ip: str, port: str) -> bool:
"""Poll RebootStatus until completion or timeout."""
logger.log_notice(
f"Polling RebootStatus for {dpu_name} at {dpu_ip}:{port} "
f"(timeout {STATUS_POLL_TIMEOUT_SEC}s, interval {STATUS_POLL_INTERVAL_SEC}s)"
)
deadline = time.monotonic() + STATUS_POLL_TIMEOUT_SEC
status_cmd = [
"docker", "exec", "gnmi", "gnoi_client",
f"-target={dpu_ip}:{port}",
"-logtostderr", "-notls",
"-module", "System",
"-rpc", "RebootStatus"
]
while time.monotonic() < deadline:
rc_s, out_s, err_s = execute_gnoi_command(status_cmd, timeout_sec=STATUS_RPC_TIMEOUT_SEC)
if rc_s == 0 and out_s and ("reboot complete" in out_s.lower()):
return True
time.sleep(STATUS_POLL_INTERVAL_SEC)
return False

def _handle_successful_reboot(self, dpu_name: str, transition_type: str):
"""Handle successful reboot completion, including clearing transition flags if needed."""
if transition_type == "reboot":
success = self._mb.clear_module_state_transition(self._db, dpu_name)
if success:
logger.log_info(f"Cleared transition for {dpu_name}")
else:
logger.log_warning(f"Failed to clear transition for {dpu_name}")
logger.log_info(f"Halting the services on DPU is successful for {dpu_name}.")

# #########
# Main loop
# #########

def main():
# Connect for STATE_DB pubsub + reads
db = SonicV2Connector()
db.connect(db.STATE_DB)

# Centralized transition reader
module_base = ModuleBase()

# gNOI reboot handler
reboot_handler = GnoiRebootHandler(db, module_base)

pubsub = _get_pubsub(db)
state_dbid = _get_dbid_state(db)

# Listen to keyspace notifications for CHASSIS_MODULE_TABLE keys
topic = f"__keyspace@{state_dbid}__:CHASSIS_MODULE_TABLE|*"
pubsub.psubscribe(topic)

logger.log_info("gnoi-shutdown-daemon started and listening for shutdown events.")

# Start background timeout enforcement so stuck transitions auto-clear
enforcer = TimeoutEnforcer(db, module_base, interval_sec=5)
enforcer.start()

while True:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like this loop is large and can create issues for debugging and maintaining. Maybe some logic can be extracted out. For example, the gnoi reboot->poll reboot status can be extracted into a class and condensed into a single function. But it is up to you.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hdwhdw Fixed

message = pubsub.get_message()
if message and message.get("type") == "pmessage":
channel = message.get("channel", "")
# channel format: "__keyspace@N__:CHASSIS_MODULE_TABLE|DPU0"
key = channel.split(":", 1)[-1] if ":" in channel else channel

if not key.startswith("CHASSIS_MODULE_TABLE|"):
time.sleep(1)
continue

# Extract module name
try:
dpu_name = key.split("|", 1)[1]
except IndexError:
time.sleep(1)
continue

# Read state via centralized API
try:
entry = module_base.get_module_state_transition(db, dpu_name) or {}
except Exception as e:
logger.log_error(f"Failed reading transition state for {dpu_name}: {e}")
time.sleep(1)
continue

transition_type = entry.get("transition_type")
if entry.get("state_transition_in_progress", "False") == "True" and (transition_type == "shutdown" or transition_type == "reboot"):
logger.log_info(f"{transition_type} request detected for {dpu_name}. Initiating gNOI reboot.")
reboot_handler.handle_transition(dpu_name, transition_type)

# NOTE:
# For shutdown transitions, the platform clears the transition flag.
# For reboot transitions, the daemon clears it upon successful completion.
# The TimeoutEnforcer thread clears any stuck transitions that exceed timeout.

time.sleep(1)

if __name__ == "__main__":
main()

Loading
Loading