Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions scripts/fast-reboot
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ DEVPATH="/usr/share/sonic/device"
PLATFORM=$(sonic-cfggen -H -v DEVICE_METADATA.localhost.platform)
PLATFORM_PLUGIN="${REBOOT_TYPE}_plugin"
LOG_SSD_HEALTH="/usr/local/bin/log_ssd_health"
SSD_UTIL="/usr/local/bin/ssdutil"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we backport these changes?

PLATFORM_FWUTIL_AU_REBOOT_HANDLE="platform_fw_au_reboot_handle"
PLATFORM_REBOOT_PRE_CHECK="platform_reboot_pre_check"
SSD_FW_UPDATE="ssd-fw-upgrade"
Expand Down Expand Up @@ -53,6 +54,7 @@ EXIT_TEAMD_RETRY_COUNT_FAILURE=23
EXIT_NO_MIRROR_SESSION_ACLS=24
EXIT_PFC_STORM_DETECTED=25
EXIT_LEFTOVER_CPA_TUNNEL=30
EXIT_SDD_HEALTH_FAILURE=40

function error()
{
Expand Down Expand Up @@ -557,6 +559,29 @@ function reboot_pre_check()

check_db_integrity

# Check SSD health
if [ -x "${SSD_UTIL}" ]; then
debug "Checking ssd health before ${REBOOT_TYPE}..."
health_line=$(${SSD_UTIL} | grep -E "Health\s*:\s*[0-9]+\.?[0-9]*%" || true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. What is runtime for this new utility? Can it extend the warm-reboot overall runtime considerably?


if [ -z "$health_line" ]; then
debug "Warning: Health line not found in ${SSD_UTIL} output."
else
# Extract the numeric value X from "Health : X%"
health_value=$(echo "$health_line" | sed -n 's/[^0-9]*\([0-9]\+\).*/\1/p')

# Check if health_value is a valid number
if [ -z "$health_value" ]; then
debug "Warning: Could not find health percentage."
elif [ "$health_value" -gt 0 ]; then
debug "SSD Health is $health_value% — OK."
else
error "Warning: Health is $health_value% — Possible drive failure!"
exit "${EXIT_SDD_HEALTH_FAILURE}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From Boyang - this is very basic / minimal check that can be done to prevent the issue.

The core idea is to collect debug information to then implement a solution to really prevent these issues.

fi
fi
fi

# Make sure /host has enough space for warm reboot temp files
avail=$(df -k /host | tail -1 | awk '{ print $4 }')
if [[ ${avail} -lt ${MIN_HD_SPACE_NEEDED} ]]; then
Expand Down
Loading