HPC jobs with clusterutil

What is an HPC cluster?

High-performance-computing (HPC) clusters are networks of powerful servers that can run code "massively parallel", i.e. split the amount of computational work across the many CPUs availble in those servers. Each server is usually called a "node" and has a certain number of CPU cores and a certain amount of RAM (memory) to work with.

An individual unit of computational work (usually a script to be executed) is called a job. A special software called "scheduler" handles the distribution of jobs across the available nodes. Clusters often also have a special node called the "head node", which hosts the scheduler software and serves as an entry point for users to log into. For that reason, it is important to never run intensive code directly on the head node, as can potentially slow down or disrupt the entire cluster and affect the many other users on it.

To parallelize a computationally intensive task, it is usually split into smaller chunks which can then be run independently as separate jobs on the cluster. For example, aligning a FASTQ file containing 1bn reads could take a very long time. So instead one might split that file into 1000 smaller files of 1m reads each and run separate alignment tasks for each of those 1000 files in parallel. Each alignment "job" would be submitted to the scheduler, which will enter it into a waiting queue until a slot on a worker node becomes available, where it will then be executed. After all jobs have completed, one might then re-assemble the 1000 separate results into a final combined output file.

For the scheduler to know how many jobs can fit on a given node, it will need information on how many CPU cores and memory a given job can be expected to use. For example, the alignment jobs above may only need 1 CPU core and 1GB of RAM each, while running a job that trains a large machine learning model may require 64GB of RAM. For that reason, each job submission is usually accompanied by a description of resource requirements. If one of your job exceeds the requested resource allotment, the scheduler may terminate your job prematurely to ensure that other jobs are not affected by it.

Many users generally share the same queue, so you may have to wait for other users jobs to be processed on a first-come-first-served basis. Although schedulers may be configured to use a more sophisticated prioritization scheme than that. For example, they may prioritize shorter jobs over longer-running jobs. Therefore, in addition to the CPU and memory resources, one may also requision the required run time. As with the physical resources, if the job exceeds that time limit, it may also be terminated.

Golden rules of HPC cluster usage

Never run resource-intensive code directly on the head node, as this may interfere with the operation of the entire cluster. Always use proper job submissions for these.
Always requisition the proper amount of resources for your code. See below on how to determine the amount of resources you need.
- Don't over- or under-requisition resources. If you request more resources than you need, your jobs will hog CPU slots or memory that other users could have used instead. Similarly, if you under-requisition, your jobs will consume resources that were requested by others and can cause their jobs to fail.
- To figure out exactly how much you need to request, you can use the profiler tool. See the relevant section below.
Don't spam the queue. Try not to submit more jobs than there are slots available on the cluster at a given time, as this means long wait times for other users.
- If you find yourself with an excessive number of jobs, consider consolidating them and/or choosing larger chunk sizes.

Clusterutil

Clusterutil is set of tools that provide a unified user interface for job submissions across different HPC scheduler systems. When running jobs on our Galen or DC clusters, which run the Slurm and PBS-Torque schedulers, respectively, clusterutil will help you by only having to remember a single set of commands instead of two. It also provides an abstraction layer for pipelines such as Pacybara.

Installing clusterutil

If you already followed the new user setup guide for galen, you will have installed clusterutil already. Otherwise, to install clusterutil in your Galen or DC home directory:

#go to a directory to which you'd like to download the clusterutil repository. e.g. "Downloads"
# (if you don't have such a directory yet, we'll obviously need to create it first.)
mkdir -p Downloads
cd Downloads/
#clone the clusterutil repo
git clone https://github.com/jweile/clusterUtil.git
#change to repo folder
cd clusterUtil
#run the installer
./install.sh

Using Clusterutil

Submitting a job

To submit a job with clusterutil you can use the submitjob.sh script. For example, to submit a job that executes the script myExampleScript.R under the name "exampleJob", requisitioning 1 CPU core, 2GB of RAM and a 2h runtime with output to be written to a file called output.log and error messages to be written to errors.log:

submitjob.sh -n exampleJob -c 1 -m 2G -t 2:00:00 -l output.log -e errors.log -- myExampleScript.R

You can also request a specific queue. For example, on galen we have a queue that gives Roth Lab members priority access to nodes which were funded directly by our lab, called guru:

submitjob.sh -n exampleJob -c 1 -m 2G -t 2:00:00 -l output.log -e errors.log -q guru -- myExampleScript.R

Or if your job requires a particular conda environment, you can have clusterutil activate that environment for your job:

submitjob.sh -n exampleJob -c 1 -m 2G -t 2:00:00 -l output.log -e errors.log --conda pacybara -- myExampleScript.R

To get more information on the full syntax, use submitjob.sh --help:

submitjob.sh v0.0.1 

by Jochen Weile <[email protected]> 2021

Submits a new slurm job
Usage: submitjob.sh [-n|--name <JOBNAME>] [-t|--time <WALLTIME>] 
    [-c|cpus <NUMCPUS>] [-m|--mem <MEMORY>] [-l|--log <LOGFILE>] 
    [-e|--err <ERROR_LOGFILE>] [--conda <ENV>] [--] <CMD>

-n|--name : Job name. Defaults to jweile_<TIMESTAMP>_<RANDOMSTRING>
-t|--time : Maximum (wall-)runtime for this job in format HH:MM:SS.
            Defaults to 01:00:00
-c|--cpus : Number of CPUs required for this job. Defaults to 1
-m|--mem  : Maximum amount of RAM consumed by this job. Defaults to 1G
-l|--log  : Output file, to which STDOUT will be written. Defaults to 
            <JOBNAME>.out in log directory /home/rothlab/jweile/slurmlogs.
-e|--err  : Error file, to which STDERR will be written. Defaults to 
            <JOBNAME>.err in log directory /home/rothlab/jweile/slurmlogs.
-b|--blacklist : Comma-separated black-list of nodes not to use. If none
            is provided, all nodes are allowed.
-q|--queue : Which queue to use. Defaults to default queue
--conda   : activate given conda environment for job
--skipValidation : skip conda environment activation (faster submission 
            but will lead to failed jobs if environment isn't valid)
--        : Indicates end of options, indicating that all following 
            arguments are part of the job command
<CMD>     : The command to execute

IMPORTANT NOTE: The first occurrence of a positional parameter will be 
interpreted as the beginning of the job command, such that all options 
(starting with "--") will be considered part of the command, rather than 
as options for submitjob.sh itself.

The script will output the job ID of the newly submitted job.

Monitoring jobs

You can see the status of your jobs via qstat (or squeue in the case of Slurm/Galen). By default all jobs of all users will be shown, but you can restrict the output to your own using -u $USER (where $USER is your username):

qstat -u $USER

The output will show a table listing the job ID, name and state for all of your jobs (as well as other information such as runtime or which node it is running on). States may be 'PD' or 'W' for pending or waiting (depending on the system) or 'R' for running. On Galen, sometimes job submissions tend to fail, in which case they show their state as launch failed requeued held. To deal with these problems automatically you can use waitForJobs.sh.

As the name suggests, waitForJobs.sh waits for all of your jobs that are currently in the queue to be completed. It also monitors their health, so if they go into an error state such as launch failed requeued held it will automatically release or restart them as needed. As this process will run for as long as you still have other jobs in the queue, it would be best practice to submit waitForJobs.sh as a job itself:

submitjob.sh -n jobmonitor -c 1 -m 1G -t 48:00:00 -- waitForJobs.sh

The fact that this script runs until all your jobs are completed can also be used as a convenient synchronization mechanism within scripts. For example, say we want to split a fastq file into small chunks, run alignments on those chunks in parallel and then combine the results again. We will have to wait until all the alignments have completed before we can combine their results. For example:

#Let's assume we have an input FASTQ file and a BWA library we want to align it to
INFASTQ=Undetermined_S000_R1.fastq.gz
LIBRARY=/path/to/bwa/library

#create a directory for all our fastq chunk files
mkdir -p chunkdir/
#derive a file name prefix for the chunks from the input file name
OUTPREFIX=$(basename ${INFASTQ%.fastq.gz})

#decompress fastq file and split it into "small" chunks of 200K lines each
#(200K is divisible by 4, so splits only occur after a multiple of 4 lines, which each represent one read.)
gzip -cd "$INFASTQ"|split --lines=200000 --suffix-length=3 --additional-suffix=.fastq - "chunkdir/$OUTPREFIX"

#make a list of all the chunk files we just generated
CHUNKS=$(ls "chunkdir/${OUTPREFIX}*.fastq")

#start a list to collect job names in
JOBS=""
#iterate over chunks
for CHUNK in $CHUNKS; do
  #derive an output file name for our chunk
  ALNFILE="${CHUNK%.fastq}.sam"
  #submit alignment job
  RETVAL=$(submitjob.sh --cpus 8 --mem 16GB --time 1:00:00 -- bwa mem -t 8 "$LIBRARY" "$CHUNK" -o "$ALNFILE")
  #capture job id
  JOBID=${RETVAL##* }
  #and append it to our list of job ids
  JOBS=${JOBS},$JOBID
done
#wait for list of jobs to complete (and monitor their health)
waitForJobs.sh --verbose $JOBS

#after all jobs are done, we collect all the results
OUTCHUNKS=$(ls "chunkdir/${OUTPREFIX}*.sam")
#and merge them into a single output file
samtools merge "${OUTPREFIX}.bam" $OUTCHUNKS
#and delete our chunks, since we don't need them anymore
rm -r chunkdir

Emergency abort

If you accidentally submitted a whole bunch of jobs by accident (or with wrong parameters etc.) you can kill all your jobs en-masse using the command deleteallmyjobs.

Determining resource requirements

To determine time, CPU and memory requirements for a job, you can use the profiler utility that ships with clusterutil. You will want to run the profiler in an unrestricted environment (that is, not galen). We recommend connecting to rothseq1 (ssh rothseq1.mshri.on.a) or devhouse (ssh devhouse.mshri.on.ca) instead. There, simply prepend profiler.sh to the command you'd like to examine. For example:

profiler.sh bwa mem -t 8 "$LIBRARY" "$CHUNK" -o "$ALNFILE"

By default, the profiler write its output to an output file in the ~/slurmlogs/ folder starting with profile_ followed by your username and a timestamp. If you prefer you can also tell the profiler directly where you'd like it write the output via the -l option, e.g.

profiler.sh -l myProfile.tsv -- bwa mem -t 8 "$LIBRARY" "$CHUNK" -o "$ALNFILE"

The output is a tab-separated table (.tsv) file with the following columns:

Time  CPU(%)   ResidentMemory(KB)   VirtualMemory(KB) Threads

Write down the largest amount of CPU percentage, resident memory, and the total amount of time that the process consumed during its execution. Divide the CPU percentage by 100 and round up to the next integer, similarly round up the resident memory usage to the closest gigabyte and the runtime to next hour. If your process uses resources in a static way (i.e. independent of its inputs) then you can use these values for your resource allocations. For example:

Peak CPU usage 289% => reserve 3 cores
Peak Memory 3.2GB => reserve 4GB
Runtime: 41min => reserve 1h.

If your process uses resources dynamically relative to input size, repeat the profing with a number of differently sized inputs until you can identify a pattern. (e.g. something like 2GB baseline RAM + 1.2GB RAM per 1GB input file size)

To help you with this, you can also use the profile visualizer. For example if your profiler output is in teh file myProfile.tsv, you can run

profiler_vis.sh myProfile.tsv

This will create a PDF file with line plots for each resource track. For example:

example of profiler_vis output

The figure above shows the profiler visualization for a variant caller job from tileseq_mut. CPU and Memory spike early on (during Q-score calibration) to almost 700% CPU (i.e. 7 cores) and 26GB RAM. Both then drop to a lower level before slowly increasing again to 7 cores and 18GB RAM. The overall runtime was just over 4 hours. If this were a static job, one might want to request 7-8 cores, ~28GB RAM, and 5-6h runtime. However, the job's memory usage actually depends on the size of the input file. In this case it can be approximated as $$M_{tot} = 11GB + 2GB \cdot \frac{n_{reads}}{10^6}$$

Logging into a worker node interactively

On Galen, you cannot log into a worker node directly, however, we ask the scheduler to start an interactive job that runs a shell and connect it to your terminal, which from our perspective would be mostly indistinguishable from a real login. clusterutil offers a script for this: slogin. However, due to a quirk with galen, this can sometimes fail and may need to be attempted a couple of times before it actually succeeds. You can also request a specific node that you would like to "login" to, e.g:

slogin galen23

Important: using slogin only requests a single CPU core and 1GB of RAM for your interactive session, so it is not recommended for intensive use.

On the DC cluster, this is easier, as DC doesn't really have an accessible head node per-se. DC has 8 worker nodes: dc01, dc02, ..., dc08 which you can all login to directly. (e.g. ssh dc04.ccbr.utoronto.ca)

Examining available nodes

To see information about the nodes availabe on the system you can use listnodes. This will procuce a (system-dependent) output listing available cpu cores, memory and states of all the available nodes on the system. You can use this to calculate the total number of accessible cpu cores and thus how many parallel jobs can be considered reasonable. Note that some nodes might be listed multiple times if they are part of multiple queues ("partitions"). There is a total limit of 1000 job submissions at a given time.

Current cluster resource stats

Here are the numbers of CPU cores, memory and memory per core for the Galen cluster as of April 2023:

CPU cores per node

#cores count
12     27
16      1
24     10
32     11
40     12
48      2

Total memory per node

RAM(GB)  count
 24      15
 56      22
 64      10
128       6
256       2
386       8

Memory per core

RAM(GB)  count
 2       10
 2       15
 2.4     10
 2.6      2
 3.2      4
 4.7     12
 7.5      1
 9.6      8
16.0      1

HPC jobs with clusterutil

HPC jobs with clusterutil

What is an HPC cluster?

Golden rules of HPC cluster usage

Clusterutil

Installing clusterutil

Using Clusterutil

Submitting a job

Monitoring jobs

Emergency abort

Determining resource requirements

Logging into a worker node interactively

Examining available nodes

Current cluster resource stats

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally