Skip to content

Conversation

gboudreau
Copy link

@gboudreau gboudreau commented Aug 11, 2025

The return value of smartctl is a bitmask, and bit 6 being set signifies the presence of entries in the SMART error log. Therefore, an exit code of 64 does not necessarily mean the disk is in imminent failure, but it does indicate that past issues have occurred and are logged within the drive's SMART data.

This patch ignores that bit from the exit code of smartctl, during detection, allowing the collector to continue working with that drive.

Note: Masking with 0xBF (AKA 0b10111111, AKA 191) ignores bit 6. That trick comes from here.


Tested with:

docker build -f docker/Dockerfile.collector . -t scrutiny-dev && \
docker run --rm --name=scrutiny-dev \
  -e COLLECTOR_API_ENDPOINT=http://192.168.155.88:8087 \
  -e COLLECTOR_COMMANDS_METRICS_SCAN_ARGS='-xv 188,raw16 --scan --json -n standby' \
  -e COLLECTOR_COMMANDS_METRICS_INFO_ARGS='-xv 188,raw16 --info --json -n standby' \
  -e COLLECTOR_COMMANDS_METRICS_SMART_ARGS='-xv 188,raw16 --xall --json -n standby' \
  --cap-add SYS_RAWIO --mount type=tmpfs,destination=/tmp \
  --device /dev/sda \
  --device /dev/sdg \
  --device /dev/sdd \
  -it scrutiny-dev
docker exec -it scrutiny-dev bash -c '. /env.sh; /opt/scrutiny/bin/scrutiny-collector-metrics run'

Result with this fix:

2025/08/11 23:23:25 No configuration file found at /opt/scrutiny/config/collector.yaml. Using Defaults.

 ___   ___  ____  __  __  ____  ____  _  _  _  _
/ __) / __)(  _ \(  )(  )(_  _)(_  _)( \( )( \/ )
\__ \( (__  )   / )(__)(   )(   _)(_  )  (  \  /
(___/ \___)(_)\_)(______) (__) (____)(_)\_) (__)
AnalogJ/scrutiny/metrics                                dev-0.8.1

INFO[0000] Verifying required tools                      type=metrics
INFO[0000] Executing command: smartctl -xv 188,raw16 --scan --json -n standby  type=metrics
INFO[0000] Executing command: smartctl -xv 188,raw16 --info --json -n standby /dev/sda  type=metrics
INFO[0000] Generating WWN                                type=metrics
INFO[0000] Executing command: smartctl -xv 188,raw16 --info --json -n standby /dev/sdd  type=metrics
WARN[0002] Successfully retrieved device information for sdd, but received exit code 64, which is a non-fatal exit code. Continuing.  type=metrics
INFO[0002] Generating WWN                                type=metrics
INFO[0002] Executing command: smartctl -xv 188,raw16 --info --json -n standby /dev/sdg  type=metrics
WARN[0002] Successfully retrieved device information for sdg, but received exit code 64, which is a non-fatal exit code. Continuing.  type=metrics
INFO[0002] Generating WWN                                type=metrics
INFO[0002] Sending detected devices to API, for filtering & validation  type=metrics
INFO[0002] Collecting smartctl results for sda           type=metrics
INFO[0002] Executing command: smartctl -xv 188,raw16 --xall --json -n standby --device sat /dev/sda  type=metrics
INFO[0003] Publishing smartctl results for 0x5000cca295db7b43  type=metrics
INFO[0004] Collecting smartctl results for sdd           type=metrics
INFO[0004] Executing command: smartctl -xv 188,raw16 --xall --json -n standby --device sat /dev/sdd  type=metrics
ERRO[0006] smartctl returned an error code (64) while processing sdd  type=metrics
ERRO[0006] smartctl detected a error log with errors     type=metrics
INFO[0006] Publishing smartctl results for 0x5000039fe3d56526  type=metrics
INFO[0007] Collecting smartctl results for sdg           type=metrics
INFO[0007] Executing command: smartctl -xv 188,raw16 --xall --json -n standby --device sat /dev/sdg  type=metrics
ERRO[0007] smartctl returned an error code (64) while processing sdg  type=metrics
ERRO[0007] smartctl detected a error log with errors     type=metrics
INFO[0007] Publishing smartctl results for 0x5000cca261c4e988  type=metrics
INFO[0008] Main: Completed                               type=metrics

Without this patch, the collector was NOT sending anything to the scrutiny frontend, resulting in drives appearing red, since no data has been received for more than a month. Here's the log I have, for the same test with the original code:

2025/08/11 23:28:08 No configuration file found at /opt/scrutiny/config/collector.yaml. Using Defaults.

 ___   ___  ____  __  __  ____  ____  _  _  _  _
/ __) / __)(  _ \(  )(  )(_  _)(_  _)( \( )( \/ )
\__ \( (__  )   / )(__)(   )(   _)(_  )  (  \  /
(___/ \___)(_)\_)(______) (__) (____)(_)\_) (__)
AnalogJ/scrutiny/metrics                                dev-0.8.1

INFO[0000] Verifying required tools                      type=metrics
INFO[0000] Executing command: smartctl -xv 188,raw16 --scan --json -n standby  type=metrics
INFO[0000] Executing command: smartctl -xv 188,raw16 --info --json -n standby /dev/sda  type=metrics
INFO[0000] Generating WWN                                type=metrics
INFO[0000] Executing command: smartctl -xv 188,raw16 --info --json -n standby /dev/sdd  type=metrics
ERRO[0002] Could not retrieve device information for sdd: exit status 64  type=metrics
INFO[0002] Executing command: smartctl -xv 188,raw16 --info --json -n standby /dev/sdg  type=metrics
ERRO[0002] Could not retrieve device information for sdg: exit status 64  type=metrics
INFO[0002] Sending detected devices to API, for filtering & validation  type=metrics
INFO[0002] Collecting smartctl results for sda           type=metrics
INFO[0002] Executing command: smartctl -xv 188,raw16 --xall --json -n standby --device sat /dev/sda  type=metrics
INFO[0003] Publishing smartctl results for 0x5000cca295db7b43  type=metrics
INFO[0004] Main: Completed                               type=metrics

The return value of `smartctl` is a bitmask, and bit 6 being set signifies the presence of entries in the SMART error log.
Therefore, an exit code of 64 does not necessarily mean the disk is in imminent failure, but it does indicate that past issues have occurred and are logged within the drive's SMART data.

This patch ignores that bit from the exit code of smartctl, during detection, allowing the collector to continue working with that drive.

Note: Masking with `0xBF` (`10111111`) ignores the bit 6.
@AnalogJ
Copy link
Owner

AnalogJ commented Oct 19, 2025

can we re-use

func (c *BaseCollector) LogSmartctlExitCode(exitCode int) {
in some way?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants