failed to open persistent connection to host:slurm:6819: Connection refused

**Snakemake version**
v9.4.1

**Describe the bug**
I’m unable to keep Snakemake workflows running for extended periods (e.g., over a day). Snakemake crashes with the following error related to sacct:
```
The job status query failed with command: sacct -X --parsable2 --clusters all --noheader --format=JobIdRaw,State --starttime 2025-07-07T00:00 --endtime now --name 05d9d4f4-c734-4fdf-a795-28ed0e2f6b6b
Error message: sacct: error: _open_persist_conn: failed to open persistent connection to host:slurm:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused
```

When this happens, I can manually restart the workflow, but the issue recurs after about a day. It seems like an intermittent loss of connectivity with sacct.

My workflow command:

` snakemake --configfile {config_path} --use-conda --profile ./slurm -j 60 --keep-going --cores 24  --latency-wait 60 --executor slurm --rerun-triggers mtime --resources load=60  `

**Additional context**

I notice this is similar to the old bug in [2411](https://github.com/snakemake/snakemake/issues/2411), but this seems to have been marked resolved.

Following advice from our cluster admin (who mentioned squeue is more stable), I wrote a custom `slurm_status.s`h script using `squeue` and `scontrol` to replace `sacct`:
```

#!/bin/bash

jobid="$1"

# Define the state mapping as an associative array
declare -A STATE_MAP=(
    ["BOOT_FAIL"]="failed"
    ["CANCELLED"]="failed"
    ["COMPLETED"]="success"
    ["CONFIGURING"]="running"
    ["COMPLETING"]="running"
    ["DEADLINE"]="failed"
    ["FAILED"]="failed"
    ["NODE_FAIL"]="failed"
    ["OUT_OF_MEMORY"]="failed"
    ["PENDING"]="running"
    ["PREEMPTED"]="failed"
    ["RUNNING"]="running"
    ["RESIZING"]="running"
    ["SUSPENDED"]="running"
    ["TIMEOUT"]="failed"
    ["UNKNOWN"]="running" # Assuming UNKNOWN should be treated as still active/running for monitoring
    ["REQUEUED"]="running" # Job is being requeued, so still active
    ["REQUEUE_HOLD"]="running" # Held but being requeued, still active
    ["SPECIAL_EXIT"]="failed" # Indicates a non-standard exit, usually implies failure
    ["STOPPED"]="running" # Job is stopped but resources are retained, still active
    ["SIGNALING"]="running" # Job is being signaled, still active
    ["STAGE_OUT"]="running" # Job is staging out data, still active
)

# First, check the active queue with squeue
# -h: no header
# -j "${jobid}": specify job ID
# -o "%T": output only the job state in compact form
status=$(squeue -h -j "${jobid}" -o "%T")

if [ -n "${status}" ]; then
    # Job is still in the active queue
    # Convert status to uppercase for consistent mapping
    status_upper=$(echo "${status}" | tr '[:lower:]' '[:upper:]')
    
    if [[ -v STATE_MAP["${status_upper}"] ]]; then
        echo "${STATE_MAP["${status_upper}"]}"
    else
        # If a state from squeue isn't in our map, treat it as unknown/running
        # This handles new states or less common transient states
        echo "running" 
    fi
else
    # If squeue is empty, the job is no longer in the active queue.
    # It has either completed, failed, or been purged.
    # Use scontrol to get the final state and exit code.
    # Redirect stderr to /dev/null to handle cases where the job is too old for scontrol.
    scontrol_output=$(scontrol show job "${jobid}" 2>/dev/null)
    
    if [ -z "${scontrol_output}" ]; then
        # Fallback: If scontrol also has no record (job is very old or purged),
        # we have to assume success as a last resort. This is an unavoidable
        # limitation without sacct, but the failure window is now much smaller.
        echo "success"
    else
        # Extract the state and exit code from scontrol output
        scontrol_state=$(echo "${scontrol_output}" | grep -oP 'JobState=\K[A-Z_]+')
        scontrol_exit_code=$(echo "${scontrol_output}" | grep -o 'ExitCode=[0-9]*:[0-9]*' | cut -d= -f2)

        if [[ -v STATE_MAP["${scontrol_state}"] ]]; then
            echo "${STATE_MAP["${scontrol_state}"]}"
        elif [ "${scontrol_exit_code}" == "0:0" ]; then
            # Exit code is 0, the job was successful.
            echo "success"
        else
            # Exit code is non-zero, the job failed.
            echo "failed"
        fi
    fi
fi

```

However, I hit a wall when I realized --cluster-status appears to be deprecated or unsupported in v9.4.1:

`snakemake: error: unrecognized arguments: --cluster --cluster-status ./slurm_status.sh`


Is there a recommended way in newer Snakemake versions to combat this issue?

If not:

Should I downgrade to a version that supports --cluster-status?

Any guidance would be appreciated thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

failed to open persistent connection to host:slurm:6819: Connection refused #331

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

failed to open persistent connection to host:slurm:6819: Connection refused #331

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions