-
Notifications
You must be signed in to change notification settings - Fork 751
Add ssd health pre-check for warm-reboot #4086
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
/azp run |
|
Azure Pipelines successfully started running 1 pipeline(s). |
| debug "SSD Health is $health_value% — OK." | ||
| else | ||
| error "Warning: Health is $health_value% — Possible drive failure!" | ||
| exit "${EXIT_SDD_HEALTH_FAILURE}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From Boyang - this is very basic / minimal check that can be done to prevent the issue.
The core idea is to collect debug information to then implement a solution to really prevent these issues.
| # Check SSD health | ||
| if [ -x "${SSD_UTIL}" ]; then | ||
| debug "Checking ssd health before ${REBOOT_TYPE}..." | ||
| health_line=$(${SSD_UTIL} | grep -E "Health\s*:\s*[0-9]+\.?[0-9]*%" || true) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- What is runtime for this new utility? Can it extend the warm-reboot overall runtime considerably?
| PLATFORM=$(sonic-cfggen -H -v DEVICE_METADATA.localhost.platform) | ||
| PLATFORM_PLUGIN="${REBOOT_TYPE}_plugin" | ||
| LOG_SSD_HEALTH="/usr/local/bin/log_ssd_health" | ||
| SSD_UTIL="/usr/local/bin/ssdutil" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we backport these changes?
What I did
Check SSD health using ssdutil before warm-reboot
How I did it
Check the health of SSD based on the output of ssdutil. Stop warm-reboot early if the health number is 0.
How to verify it
The check will be skipped if the command ssdutil returned with error
The added lines can correctly parse the output of ssdutil in the format of "Health : X%" or "Health : X.Y%"
The check will block fast-reboot/warm-reboot if the extracted health number is 0
Previous command output (if the output of a command-line utility has changed)
New command output (if the output of a command-line utility has changed)