Submitting Jobs - Batch Scripts

The SLURM sbatch command allows automatic and persistent execution of commands. The list of commands sbatch performs are defined in a job batch (or submission) script, a BaSH shell script with some specialized cluster environment variables and commands. BaSH itself is a general purpose interpreted language which can perform almost any algorithm, but it is usually used to execute more efficient compiled programs designed to run on a cluster. Although many users' scripts are more complicated, typical users' batch scripts perform 3 tasks:

Initialize resources
1. Copy and move files to scratch
2. Run preprocessing
3. Calculate input variables
Execute task
1. Specialized program execution
2. Data collection (this is usually done in parallel with execution)
Finalize resources
1. Run postprocessing
2. Move files to job directory
3. Remove files

Overview

Batch Submission
Example Scripts
Execution and Data Models
- Serial
- GPGPU (CUDA and OpenCL)
- SMP
- Scratch Space
- MPI
- DGAS
Software
Modules
Helpful Scheduler Commands
- squeue
- sacct
- scancel
- scontrol
- srun

Batch Submission

To simplify things, a number of template batch scripts have been created, in Examples Scripts section below, that most users can fill in required commands and resources. In addition to the steps defined above, there is a specialized set of comments, prefixed with #SBATCH, that act as input to the sbatch command options, as shown in the following example:

[username@log001 ~]# cat example.sh #This is our batch script.
#!/bin/bash
#SBATCH --cpus-per-task 2 --nodes 2 

srun hostname
sleep 120
exit 0
[username@log01 ~]# sbatch example.sh --partition computeq #Note that ordering matters here!
sbatch: error: Batch job submission failed: No partition specified or system default partition
[username@master1 ~]# sbatch --partition=acomputeq example.sh
Submitted batch job 114499
[username@log01 ~]# squeue -j 114499 #We can use squeue to see that our script is running.
 JOBID PARTITION     NAME     USER ST TIME NODES NODELIST(REASON)
114499  acomputeq example. username  R 0:06     2       c[077-078]
[username@log001 ~]# cat slurm-114499.out #This is the output of our batch script.
c077
c078

sbatch Options

sbatch options can be viewed by running man sbatch in the cluster terminal. Some of the more common options/switches in sbatch:

Long Form	Short Form	Description
--ntasks N	-n N	Number of cores, N, required by job, assuming C is default (1 CPU per task). The scheduler will distribute all tasks on 1 to N nodes if --nodes is undefined. Up to 1000 can be defined.
--cpus-per-task C	-c C	Number of CPUs per task, C. These are always shared memory CPUs on one node. Up to 40 can be defined, and 1 CPU per task is the default.
--nodes minM --nodes minM-maxM	-N minM -N minM-maxM	Number of nodes, from minM to maxM, to execute tasks on. Three combinations are usually used: "--nodes M --ncpus-per-task C": Execute job on M nodes with C cores per node "--nodes minM-maxM --ntasks N": Execute job with N cores on at least minM nodes up to at most maxM nodes "--nodes minM --ntasks N": Execute job on at least minM nodes with N cores
--partition P	-p P	Execute script on a node in partition P. We have three partitions: "--partition acomputeq": General purpose jobs with up to 192 CPUs per node, 16 nodes, and 768 GB of memory per node. "--partition awholeq": Whole node jobs requiring all 192 CPUs per node, 8 nodes, and 768 GB of memory per node. This partition has a maximum time-limit of 7 days. This partition is also great for testing a job that you suspect might crash a node. "--partition abigmemq": General purpose jobs with up to 192 CPUs per node, 4 nodes, and 1536 GB of memory per node. "--partition agpuq": GPU jobs that can utilize up to 2 GPUs per node, 80 gigabytes of memory per GPU, 192 CPUs per node, 4 nodes, and 768 GB of host memory per node. Please use --gres option below!
--gres=gpu:G		Used in combination with gpuq partition. Use G GPUs per node. Can be 1 or 2.
--mem-per-cpu E		Use E megabytes of memory per cpu. Default is 4096 megabytes. Can include units K (kilobytes), M (megabytes), G (gigabytes), and T (terabytes). For example "--mem-per-cpu 4G" would use 4 gigabytes of memory per CPU.
--array A	--a A	Execute the script many different times with A indexing. SLURM_ARRAY_TASK_ID environment variable provides access to one index. Indexing can be as follows: "--array 1,2,3,4,7-10": Job executes with SLURM_ARRAY_TASK_ID's 1, 2, 3, 4, and 7 through 10. Indexes can be modified individually with "," (comma character) or as a range with "-" (dash or hyphen character). "--array 1,2,3,4,7-10%4": Job executes with SLURM_ARRAY_TASK_ID's 1, 2, 3, 4, and 7 through 10, but the "%" (the modulus or percent character) modifier allows only 4 jobs to execute at a time. "--array 1,2,3,4,7-10:2": Job executes with SLURM_ARRAY_TASK_ID's 1, 2, 3, 4, 7, and 9. Specifically, the ":" (the colon or step character) modifies the range to step by a value. In this case, the range "7-10" is stepped by "2" giving indexes 7 and 9.

Example Scripts

Batch script examples are presented below (coming soon). Using the slurm sbatch command can be used to submit jobs using these scripts, and the prefilled options can be modified to suite your job's need.

Script	Description
serial.sh	Executing a job on 1 node with 1 CPU.
smp.sh	Executing a job on 1 node with many CPUs.
mpi.sh	Executing a job on multiple nodes with many CPUs.
scratch.sh	Executing a job with scratch space.
array.sh	Executing many jobs with indexing.

Execution and Data Models

The HPC's cluster supports a variety of execution and data models. For many users, simply executing a lot of concurrent serial jobs is enough, but some users may require different levels of parallelism. Some common modes of parallelism used on the cluster rely on the distinction of program and memory access locale¹:

	Shared Access Memory	Distributed or Simultaneous Access Memory
Single CPU	Single thread or Serial	Stream processing or General Purpose GPU (GPGPU)
Many CPUs	Single Node, Symmetric Multiprocessing (SMP)	Single Node plus Memory (or Disk) Locale swapping (Scratch Space)
Distributed CPUs	Distributed Global Address Space (D GAS)	Message Passing Interface (MPI)

1. Note that this particular taxonomy is not exhaustive and there are many mixtures of all these techniques.

Serial

The most common jobs on the HPC's cluster. Most users execute individual scripts, with sbatch, for each of these jobs. If a user needs a lot of jobs, a generator script can create and submit many of these jobs. A simple method for generating jobs is shown below:

[username@log01 ~]$ cat generate.sh #The generator script.
#!/bin/bash
input=("bill" "ted" "lisa" "42")
for i in ${input[*]}
do
    cp example.sh example_${i}.sh
    echo "echoMe $i" >> example_${i}.sh
    sbatch example_${i}.sh
done
[username@log01 ~]$ cat example.sh #A base submission script.
#!/bin/bash
#SBATCH --partition acomputeq
#SBATCH --cpus-per-task 1

function echoMe {
    echo "Hello $1 from job $SLURM_JOB_ID"
    sleep 120
    exit 0
}
[username@log01 ~]$ ./generate.sh #Multiple submissions all in one command.
Submitted batch job 114761
Submitted batch job 114762
Submitted batch job 114763
Submitted batch job 114764
[username@log01 ~]$ squeue -u username
 JOBID  PARTITION     NAME     USER ST TIME NODES NODELIST(REASON)
114761  acomputeq example_ username  R 0:07     1            ac10
114762  acomputeq example_ username  R 0:07     1            ac05
114763  acomputeq example_ username  R 0:07     1            ac05
114764  acomputeq example_ username  R 0:07     1            ac09
[username@log01 ~]$ cat slurm-11476* #Concatenated output from all the jobs.
Hello bill from job 114761
Hello ted from job 114762
Hello lisa from job 114763
Hello 42 from job 114764
[username@log01 example]$ cat example_bill.sh #Each derived submission script looks similar to this.
#!/bin/bash
#SBATCH --partition acomputeq
#SBATCH --cpus-per-task 1

function echoMe {
echo "Hello $1 from job $SLURM_JOB_ID"
sleep 120
exit 0
}

echoMe bill

Another way to submit many jobs is to provide your program with indexing, also known as an array job. This isn't always possible if jobs have dependencies on other jobs or require unusual input. Currently the limit for array jobs is indexes from 0 to 32768. An example of an array job based on the one above:

[username@log01 ~]$ cat example_array.sh #Submission script for our array job.
#!/bin/bash
#SBATCH --partition acomputeq
#SBATCH --cpus-per-task 1
#SBATCH --array 0-3

input=("bill" "ted" "lisa" "42")

function echoMe {
    echo "Hello $1 from job $SLURM_JOB_ID"
    sleep 120
    exit 0
}

echoMe ${input[SLURM_ARRAY_TASK_ID]}
[username@log01 ~]$ sbatch example_array.sh #Submitting our array job.
Submitted batch job 114773
[username@log01 ~]$ squeue -u username
   JOBID  PARTITION     NAME     USER ST TIME NODES NODELIST(REASON)
114773_0  acomputeq example_ username  R 0:03     1             ac01
114773_1  acomputeq example_ username  R 0:03     1             ac02
114773_2  acomputeq example_ username  R 0:03     1             ac02
114773_3  acomputeq example_ username  R 0:03     1             ac05
[username@log01 ~]$ cat slurm-114773_* #Output from our array job.
Hello bill from job 114773
Hello ted from job 114774
Hello lisa from job 114775
Hello 42 from job 114776

GPGPU (CUDA and OpenCL)

GPUs utilize many lightweight cores to process data. The paradigm is called stream computing, and it is similar to another paradigm called Single Instruction Multiple Data (SIMD). The idea with SIMD is that a lot of distinct data is processed by a single instruction, creating new data that can be processed further. CUDA and OpenCL extend this with special functions called kernels that act as single instructions. Executing CUDA and OpenCL programs is pretty simple as long as --partition gpuq and --gres gpu:G sbatch options are used. Also, if you use CUDA, make sure to load the appropriate modules in your submission script.

SMP

Sometimes referred to as multi-threading, this type of job is extremely popular on the HPC's cluster. These jobs require the maximum number of CPUs to be known ahead of time. Common implementations are POSIX threads, OpenMP, C++11 threads, BaSH Job Control, Python multiprocessing, and MATLAB Parallel Computing Toolbox. There are many more, but a common requirement is to define sbatch's --cpus-per-task option and passing the number of CPUs to the program being executed. With OpenMP, the environment variable OMP_NUM_THREADS should be defined in the job's submission script:

#This is a submission script excerpt!
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

Other programs may require the user to pass the environment variable SLURM_CPUS_PER_TASK to the program; for example, MATLAB has an internally defined function called maxNumCompThreads which can be defined in your MATLAB script:

%This is a matlab script excerpt!
maxNumCompThreads(str2num(getenv('SLURM_CPUS_PER_TASK')))

Scratch Space

In some cases, the largest nodes on a cluster cannot provide enough memory to load all data for a job. To accommodate more memory, a job can utilize a work directory (/home/scratch/$USER). The general idea is for a program to grab some data, do work, store the modified data to the work directory, and then repeat. This is similar to the page file in Microsoft Windows and the swap partition on Linux or Linux-like operating systems. Like page file and swap partitions, this technique is significantly slower than having all the data in memory. On the other hand, scratch space has two advantages over using swap space on the cluster. First, it allows more efficient algorithms to manage data specific to the program. Second, it allows a program to use much less memory than it would otherwise. To utilize it in this manner, the program must be aware of where the work directory is located. For many programs, this is defined by the TMPDIR (or TMP, TEMP, TMP_DIR, TEMP_DIR, etc...) environment variable:

#This is a submission script excerpt!
export TMPDIR=/scratch/$USER/$SLURM_JOB_ID/
mkdir $TMPDIR

#Run your program here 

rm -r $TMPDIR

In MATLAB, a parcluster member, JobStorageLocation can define a scratch space within the MATLAB script:

%This is a matlab script excerpt!
maxNumCompThreads(str2num(getenv('SLURM_CPUS_PER_TASK')))
cluster=parcluster('local')
cluster.JobStorageLocation = getenv('TMPDIR')

In Java, the tmpdir member from java.io can define the scratch upon execution in your submission script:

#This is a submission script excerpt!
export TMPDIR=/scratch/$USER/$SLURM_JOB_ID/
mkdir $TMPDIR

java -Djava.io.tmpdir=$TMPDIR #... 

rm -r $TMPDIR

Another use case for scratch space is to upload large chunks of data to the cluster that don't need to be backed up. If you have several terabytes of data, like genomic or stock data, that only need to be on the cluster while your job is running, it is best to take the following steps:

Define a directory name, such as MYDATA ("export MYDATA=/scratch/username/directoryName" from the cluster terminal).
Make a directory for your data in the cluster's scratch directory ("mkdir $MYDATA" from the cluster terminal).
Upload your data to your newly created directory in scratch (use WinSCP, MobaXTerm, or "scp -r data username@bigblue.memphis.edu:/scratch/username/directoryName" from your workstation).
Use sbatch to submit your submission scripts (make sure you include the MYDATA export from above in your submission script).
When your jobs complete and you no longer need that data copy the output back to your home directory and delete the data with the directory ("cp -r $MYDATA/myoutput $HOME/myoutput; rm $MYDATA").

MPI

"Message Passing Interface", the name really tells you how this model works. When started, an MPI program defines a master CPU and worker CPUs in a group or many groups. The CPUs in the group starts executing instructions until all CPUs reach finalize. The groups can be synchronized and check for messages from other CPUs. Each CPU can pass messages to other CPUs within a group.

Starting an MPI program can be done by using SLURM's srun command or OpenMPI/MPICH/Intel's mpirun command within a submission script. There are 3 common option combinations for submitting MPI jobs with sbatch:

"--cpus-per-task C --nodes M": Use C CPUs per node on M nodes giving C by M total CPUs.
1. This gives a big block of fixed CPUs across fixed nodes.
2. The advantage is increased speed from CPU-CPU locality and shared memory on single tasks.
3. The disadvantage is potentially increased job pending time and disallowing the scheduler to make space for other jobs.
4. If your job requires symmetry among nodes, this will force it.
"--ntasks N": Use N CPUs on at most N nodes.
1. This can be allocated anywhere on the cluster where CPUs are available.
2. The advantage is quick job submission and allowing the scheduler to make space for other jobs.
3. The disadvantage is potential slowdown due to decreased CPU-CPU locality and inability to used shared memory.
4. This will not guarantee symmetry among nodes.
"--ntasks N --cpus-per-task C": Use C CPUS per task on up to N nodes giving C by N total CPUs.
1. This gives lots of small CPU chunks across many nodes.
2. This is a best of both worlds approach compared to the first 2 option combinations.
3. This can provide symmetry among nodes.

DGAS

This model is just an abstraction that creates a "shared" memory using pinned memory (memory that is bound to a particular processor) and 1-sided communication (MPI uses 2-sided communication: send/receive). UPC, UPC++, Chapel, DEGAS, Co-array Fortran, Global Arrays, and Titanium have this capability. While this topic requires advanced knowledge to implement, utilizing it is almost identical to MPI on the HPC's cluster. An example of software that utilizes this paradigm is NWCHEM.

Software

Most of the software is located in /public/apps on the cluster:

[username@log001 ~]$ ls /public/apps/ #List everything in this directory.
abyss         caffe          gmt         moose_orig quip
adf           cap3           gnuplot     mopac             R
admb          c-blosc        gromacs     mrbayes           raxml
admixtools    cd-hit         gsl         mtmetis           rlwrap
admixture     charm          hdf4        mumps             rosetta
allpaths-lg   cmake          hdf5        namd              sac
amber         cns            hmetis      nccl              salmon
anaconda      comsol         hmmer       ncl               samtools
angsd         connectome     htslib      netcdf            sas
ansys         converge       hybpiper    novoalign         scalapack
apbs          cope           hyphy2      nwchem            scons
arrayfire     curl           imsl        openblas          simulia
atlas         espresso       intel       openbugs          slurmOther
augustus      exabayes       jellyfish   openfoam          snappy
autoconf      exonerate      jmodeltest2 openmm            spades
autodock      faiss          julia       openmpi           sqlite
autodock_vina fastqc         juliaa      OpenSees          sumo
autogrid      fastsimcoal2   lammps      orthograph        superlu_dist
automake      fastText       lapack      pandaseq          tie-array-packed
bamtools      FastTree       lastz       parallelgnu       treemix
bcftools      ffmpeg         latte       paraview          trimal
beagle        fftw           leveldb     parmetis          trimmomatic
beagle-lib    flash-modified libint      partitionfinder   trinity
beast         fox            mafft       petsc             upcxx
bedtools      gamess         matlab      phyml             vcftools
blacs         garli          md++        picard            velvet
blast         gaussian       metis       plink             vmd
blat          gcc            miniconda   pnetcdf           weka
boost         gd             modeller    progressiveCactus yasra
bowtie2       gflags         moe         psi4
bwa           git            molpro      pylith
bzip2         glog           molro       python
cactus        gmp            moose       qt

Modules

We have a variety of modules on the cluster for different software. To see these just use module avail as the following shows:

[username@log001 ~]$ module avail

---------------------------- /cm/local/modulefiles -----------------------------
cluster-tools-dell/8.1 lua/5.3.4
cluster-tools/8.1 module-git
cmd module-info
dot null
freeipmi/1.5.7 openldap
gcc/7.2.0 (L) openmpi/mlnx/gcc/64/3.0.0rc6
ipmitool/1.8.18 shared (L)

---------------------------- /usr/share/modulefiles ----------------------------
DefaultModules (L)

----------------------------- /public/modulefiles ------------------------------
FastTree/2.1.10
OpenSees/6.1.mc
OpenSees/6.1 (D)
R/3.5.1/gcc.8.2.0
R/3.5.2/gcc.8.2.0 (D)
R/3.5.0/gcc.8.2.0
abyss/2.1.0
lines 1-21

Then you can use module load in your submission script:

#This is a submission script excerpt!
module load R
#Run your program here

You can also use module load on the cluster terminal to compile, build, or just to see the effect:

[username@log001 ~]$ module load R

The following have been reloaded with a version change:
1) gcc/7.2.0 => gcc/8.2.0 

[username@log001 ~]$ R --version #Notice that this is the default (D) version from module avail
R version 3.5.2 (2018-12-20) -- "Eggshell Igloo"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit) #R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under the terms of the
GNU General Public License versions 2 or 3.
For more information about these matters see
http://www.gnu.org/licenses/.

Helpful Scheduler Commands

squeue

This command is useful if you want to see the status of your jobs:

[username@log001 ~]$ squeue -u username
JOBID    PARTITION NAME         USER ST TIME NODES NODELIST(REASON)
114773_0  computeq example_ username  R 0:03     1            c025
114773_1  computeq example_ username  R 0:03     1            c037
114773_2  computeq example_ username  R 0:03     1            c055
114773_3  computeq example_ username  R 0:03     1            c075

Typically, you will see one of several states (ST):

BF BOOT_FAIL: Job terminated due to launch failure, typically due to a hardware failure (e.g. unable to boot the node or block and the job can not be requeued).
CA CANCELLED: Job was explicitly cancelled by the user or system administrator. The job may or may not have been initiated.
CD COMPLETED: Job has terminated all processes on all nodes with an exit code of zero.
CF CONFIGURING: Job has been allocated resources, but are waiting for them to become ready for use (e.g. booting).
CG COMPLETING: Job is in the process of completing. Some processes on some nodes may still be active.
DL DEADLINE: Job terminated on deadline.
F FAILED: Job terminated with non-zero exit code or other failure condition.
NF NODE_FAIL: Job terminated due to failure of one or more allocated nodes.
OOM OUT_OF_MEMORY: Job experienced out of memory error.
PD PENDING: Job is awaiting resource allocation.
PR PREEMPTED: Job terminated due to preemption.
R RUNNING: Job currently has an allocation.
RD RESV_DEL_HOLD: Job is held.
RF REQUEUE_FED: Job is being requeued by a federation.
RH REQUEUE_HOLD: Held job is being requeued.
RQ REQUEUED: Completing job is being requeued.
RS RESIZING: Job is about to change size.
RV REVOKED: Sibling was removed from cluster due to other cluster starting the job.
SI SIGNALING: Job is being signaled.
SE SPECIAL_EXIT: The job was requeued in a special state. This state can be set by users, typically in EpilogSlurmctld, if the job has terminated with a particular exit value.
SO STAGE_OUT: Job is staging out files.
ST STOPPED: Job has an allocation, but execution has been stopped with SIGSTOP signal. CPUS have been retained by this job.
S SUSPENDED: Job has an allocation, but execution has been suspended and CPUs have been released for other jobs.
TO TIMEOUT: Job terminated upon reaching its time limit.

More information can be found by running man squeue in the cluster terminal.

sacct

As a user, you would normally see your jobs submitted over the past day:

[jspngler@log001 ~]$ sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
114190_19    10NM_Free+   computeq mlaradjil+         12  COMPLETED      0:0
114190_19.b+      batch            mlaradjil+         12  COMPLETED      0:0
114190_19.e+     extern            mlaradjil+         12  COMPLETED      0:0
114191_15    10NM_Free+   computeq mlaradjil+         12  COMPLETED      0:0
114191_15.b+      batch            mlaradjil+         12  COMPLETED      0:0
114191_15.e+     extern            mlaradjil+         12  COMPLETED      0:0
114761       example_b+   computeq mlaradjil+          1  COMPLETED      0:0
114761.batch      batch            mlaradjil+          1  COMPLETED      0:0
114761.exte+     extern            mlaradjil+          1  COMPLETED      0:0
114762       example_t+   computeq mlaradjil+          1  COMPLETED      0:0
114762.batch      batch            mlaradjil+          1  COMPLETED      0:0
114762.exte+     extern            mlaradjil+          1  COMPLETED      0:0
114763       example_l+   computeq mlaradjil+          1  COMPLETED      0:0
114763.batch      batch            mlaradjil+          1  COMPLETED      0:0
114763.exte+     extern            mlaradjil+          1  COMPLETED      0:0
114764       example_4+   computeq mlaradjil+          1  COMPLETED      0:0
114764.batch      batch            mlaradjil+          1  COMPLETED      0:0
114764.exte+     extern            mlaradjil+          1  COMPLETED      0:0
114773_4     example_a+   computeq mlaradjil+          1  COMPLETED      0:0
114773_4.ba+      batch            mlaradjil+          1  COMPLETED      0:0
114773_4.ex+     extern            mlaradjil+          1  COMPLETED      0:0
114773_0     example_a+   computeq mlaradjil+          1  COMPLETED      0:0
114773_0.ba+      batch            mlaradjil+          1  COMPLETED      0:0
114773_0.ex+ extern                mlaradjil+          1  COMPLETED      0:0
114773_1     example_a+   computeq mlaradjil+          1  COMPLETED      0:0
114773_1.ba+      batch            mlaradjil+          1  COMPLETED      0:0
114773_1.ex+     extern            mlaradjil+          1  COMPLETED      0:0
114773_2     example_a+   computeq mlaradjil+          1  COMPLETED      0:0
114773_2.ba+      batch            mlaradjil+          1  COMPLETED      0:0
114773_2.ex+     extern            mlaradjil+          1  COMPLETED      0:0
114773_3     example_a+   computeq mlaradjil+          1  COMPLETED      0:0
114773_3.ba+      batch            mlaradjil+          1  COMPLETED      0:0
114773_3.ex+     extern            mlaradjil+          1  COMPLETED      0:0
114797             bash   computeq mlaradjil+          1    RUNNING      0:0
114797.exte+     extern            mlaradjil+          1    RUNNING      0:0
114797.0           bash            mlaradjil+          1    RUNNING      0:0

The output can be controlled with other options specifying users, accounts, output, dates, and states. To see all of them, run man sacct in the cluster terminal.

scancel

Cancel a job. Usually scancel jobid will do, but if you want to cancel all your jobs, do scancel -u $USER. To see all the options, run man scancel in the cluster terminal.

scontrol

Output and modify your jobs. "scontrol show job 26943" gives some pretty detailed information about job with jobid 26943. "scontrol update jobId=26943 numtasks=1" would set the number of tasks for jobid 26943 to 1. Most update commands can only be executed on pending jobs or by an administrator. To see all the options, run man scontrol in the cluster terminal.

srun

The SLURM run command is used to run job steps in a batch job. If it is used with a non-parallel program, it will proceed to run it for every task. For example, if --ntasks 4, and srun hostname is in a script, you might see the list of the 4 nodes srun ran on with duplicates if a node allocation has more than one task. To make sure srun only runs once in a step, you can do srun --ntasks 1 hostname. For MPI jobs, srun can replace mpirun in most instances. To see all the options, run man srun in the cluster terminal.