-
Notifications
You must be signed in to change notification settings - Fork 37
Description
Snakemake version
v9.4.1
Describe the bug
I’m unable to keep Snakemake workflows running for extended periods (e.g., over a day). Snakemake crashes with the following error related to sacct:
The job status query failed with command: sacct -X --parsable2 --clusters all --noheader --format=JobIdRaw,State --starttime 2025-07-07T00:00 --endtime now --name 05d9d4f4-c734-4fdf-a795-28ed0e2f6b6b
Error message: sacct: error: _open_persist_conn: failed to open persistent connection to host:slurm:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
When this happens, I can manually restart the workflow, but the issue recurs after about a day. It seems like an intermittent loss of connectivity with sacct.
My workflow command:
snakemake --configfile {config_path} --use-conda --profile ./slurm -j 60 --keep-going --cores 24 --latency-wait 60 --executor slurm --rerun-triggers mtime --resources load=60
Additional context
I notice this is similar to the old bug in 2411, but this seems to have been marked resolved.
Following advice from our cluster admin (who mentioned squeue is more stable), I wrote a custom slurm_status.s
h script using squeue
and scontrol
to replace sacct
:
#!/bin/bash
jobid="$1"
# Define the state mapping as an associative array
declare -A STATE_MAP=(
["BOOT_FAIL"]="failed"
["CANCELLED"]="failed"
["COMPLETED"]="success"
["CONFIGURING"]="running"
["COMPLETING"]="running"
["DEADLINE"]="failed"
["FAILED"]="failed"
["NODE_FAIL"]="failed"
["OUT_OF_MEMORY"]="failed"
["PENDING"]="running"
["PREEMPTED"]="failed"
["RUNNING"]="running"
["RESIZING"]="running"
["SUSPENDED"]="running"
["TIMEOUT"]="failed"
["UNKNOWN"]="running" # Assuming UNKNOWN should be treated as still active/running for monitoring
["REQUEUED"]="running" # Job is being requeued, so still active
["REQUEUE_HOLD"]="running" # Held but being requeued, still active
["SPECIAL_EXIT"]="failed" # Indicates a non-standard exit, usually implies failure
["STOPPED"]="running" # Job is stopped but resources are retained, still active
["SIGNALING"]="running" # Job is being signaled, still active
["STAGE_OUT"]="running" # Job is staging out data, still active
)
# First, check the active queue with squeue
# -h: no header
# -j "${jobid}": specify job ID
# -o "%T": output only the job state in compact form
status=$(squeue -h -j "${jobid}" -o "%T")
if [ -n "${status}" ]; then
# Job is still in the active queue
# Convert status to uppercase for consistent mapping
status_upper=$(echo "${status}" | tr '[:lower:]' '[:upper:]')
if [[ -v STATE_MAP["${status_upper}"] ]]; then
echo "${STATE_MAP["${status_upper}"]}"
else
# If a state from squeue isn't in our map, treat it as unknown/running
# This handles new states or less common transient states
echo "running"
fi
else
# If squeue is empty, the job is no longer in the active queue.
# It has either completed, failed, or been purged.
# Use scontrol to get the final state and exit code.
# Redirect stderr to /dev/null to handle cases where the job is too old for scontrol.
scontrol_output=$(scontrol show job "${jobid}" 2>/dev/null)
if [ -z "${scontrol_output}" ]; then
# Fallback: If scontrol also has no record (job is very old or purged),
# we have to assume success as a last resort. This is an unavoidable
# limitation without sacct, but the failure window is now much smaller.
echo "success"
else
# Extract the state and exit code from scontrol output
scontrol_state=$(echo "${scontrol_output}" | grep -oP 'JobState=\K[A-Z_]+')
scontrol_exit_code=$(echo "${scontrol_output}" | grep -o 'ExitCode=[0-9]*:[0-9]*' | cut -d= -f2)
if [[ -v STATE_MAP["${scontrol_state}"] ]]; then
echo "${STATE_MAP["${scontrol_state}"]}"
elif [ "${scontrol_exit_code}" == "0:0" ]; then
# Exit code is 0, the job was successful.
echo "success"
else
# Exit code is non-zero, the job failed.
echo "failed"
fi
fi
fi
However, I hit a wall when I realized --cluster-status appears to be deprecated or unsupported in v9.4.1:
snakemake: error: unrecognized arguments: --cluster --cluster-status ./slurm_status.sh
Is there a recommended way in newer Snakemake versions to combat this issue?
If not:
Should I downgrade to a version that supports --cluster-status?
Any guidance would be appreciated thank you!