Skip to content

failed to open persistent connection to host:slurm:6819: Connection refused #331

@brettva

Description

@brettva

Snakemake version
v9.4.1

Describe the bug
I’m unable to keep Snakemake workflows running for extended periods (e.g., over a day). Snakemake crashes with the following error related to sacct:

The job status query failed with command: sacct -X --parsable2 --clusters all --noheader --format=JobIdRaw,State --starttime 2025-07-07T00:00 --endtime now --name 05d9d4f4-c734-4fdf-a795-28ed0e2f6b6b
Error message: sacct: error: _open_persist_conn: failed to open persistent connection to host:slurm:6819: Connection refused
sacct: error: Sending PersistInit msg: Connection refused
sacct: error: Problem talking to the database: Connection refused

When this happens, I can manually restart the workflow, but the issue recurs after about a day. It seems like an intermittent loss of connectivity with sacct.

My workflow command:

snakemake --configfile {config_path} --use-conda --profile ./slurm -j 60 --keep-going --cores 24 --latency-wait 60 --executor slurm --rerun-triggers mtime --resources load=60

Additional context

I notice this is similar to the old bug in 2411, but this seems to have been marked resolved.

Following advice from our cluster admin (who mentioned squeue is more stable), I wrote a custom slurm_status.sh script using squeue and scontrol to replace sacct:


#!/bin/bash

jobid="$1"

# Define the state mapping as an associative array
declare -A STATE_MAP=(
    ["BOOT_FAIL"]="failed"
    ["CANCELLED"]="failed"
    ["COMPLETED"]="success"
    ["CONFIGURING"]="running"
    ["COMPLETING"]="running"
    ["DEADLINE"]="failed"
    ["FAILED"]="failed"
    ["NODE_FAIL"]="failed"
    ["OUT_OF_MEMORY"]="failed"
    ["PENDING"]="running"
    ["PREEMPTED"]="failed"
    ["RUNNING"]="running"
    ["RESIZING"]="running"
    ["SUSPENDED"]="running"
    ["TIMEOUT"]="failed"
    ["UNKNOWN"]="running" # Assuming UNKNOWN should be treated as still active/running for monitoring
    ["REQUEUED"]="running" # Job is being requeued, so still active
    ["REQUEUE_HOLD"]="running" # Held but being requeued, still active
    ["SPECIAL_EXIT"]="failed" # Indicates a non-standard exit, usually implies failure
    ["STOPPED"]="running" # Job is stopped but resources are retained, still active
    ["SIGNALING"]="running" # Job is being signaled, still active
    ["STAGE_OUT"]="running" # Job is staging out data, still active
)

# First, check the active queue with squeue
# -h: no header
# -j "${jobid}": specify job ID
# -o "%T": output only the job state in compact form
status=$(squeue -h -j "${jobid}" -o "%T")

if [ -n "${status}" ]; then
    # Job is still in the active queue
    # Convert status to uppercase for consistent mapping
    status_upper=$(echo "${status}" | tr '[:lower:]' '[:upper:]')
    
    if [[ -v STATE_MAP["${status_upper}"] ]]; then
        echo "${STATE_MAP["${status_upper}"]}"
    else
        # If a state from squeue isn't in our map, treat it as unknown/running
        # This handles new states or less common transient states
        echo "running" 
    fi
else
    # If squeue is empty, the job is no longer in the active queue.
    # It has either completed, failed, or been purged.
    # Use scontrol to get the final state and exit code.
    # Redirect stderr to /dev/null to handle cases where the job is too old for scontrol.
    scontrol_output=$(scontrol show job "${jobid}" 2>/dev/null)
    
    if [ -z "${scontrol_output}" ]; then
        # Fallback: If scontrol also has no record (job is very old or purged),
        # we have to assume success as a last resort. This is an unavoidable
        # limitation without sacct, but the failure window is now much smaller.
        echo "success"
    else
        # Extract the state and exit code from scontrol output
        scontrol_state=$(echo "${scontrol_output}" | grep -oP 'JobState=\K[A-Z_]+')
        scontrol_exit_code=$(echo "${scontrol_output}" | grep -o 'ExitCode=[0-9]*:[0-9]*' | cut -d= -f2)

        if [[ -v STATE_MAP["${scontrol_state}"] ]]; then
            echo "${STATE_MAP["${scontrol_state}"]}"
        elif [ "${scontrol_exit_code}" == "0:0" ]; then
            # Exit code is 0, the job was successful.
            echo "success"
        else
            # Exit code is non-zero, the job failed.
            echo "failed"
        fi
    fi
fi

However, I hit a wall when I realized --cluster-status appears to be deprecated or unsupported in v9.4.1:

snakemake: error: unrecognized arguments: --cluster --cluster-status ./slurm_status.sh

Is there a recommended way in newer Snakemake versions to combat this issue?

If not:

Should I downgrade to a version that supports --cluster-status?

Any guidance would be appreciated thank you!

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions