High Performance Computing at the IARC Data Center

  • CyARC Community Cluster (CCC)

Composed of 64-core AMD nodes, each owned by an individual researcher or group, CCC employs unique software developed in-house to build clusters on-the-fly, tailored for each job. Features include workload management with SLURM, inter-node communication over QDR infiniband, dynamically allocated NFS storage over 10 Gb ethernet, and a flexible queue structure with priority for node owners.
Alternatively, a node may be operated in stand-alone mode, without participation in the community construct.

How it works:

A researcher acquires a 64-core cluster node, which we configure to their specification. Given that most jobs are 64 cores or less, the node is useful as-is, and may run in stand-alone mode. If more cores are desired, the node is additionally configured with CCC software, allowing extra cores to be procured in one of three ways:

1) The overnight queues

A researcher's node is reserved for exclusive use during the day, but late in the afternoon it becomes available in two 16-hour night queues for jobs needing more than 64 cores, with a guarantee that the node will be reserved for exclusive use again by morning.

If a researcher launches a job on their own cores before the late afternoon cutoff, however, no jobs will get scheduled on those cores during the night until the cores are idle. Furthermore, additional jobs originating on a home node have priority over any waiting queued jobs that originated on other nodes.

In other words, a researcher's node only becomes available to others at night when it is not in use, and the researcher has no other jobs waiting in their higher priority private queues on their node.

The night queues are night4max and night8max, with a maximum of 4 and 8 nodes respectively, with night4max jobs having higher priority than night8max jobs.

2) The weekend queues

Similar to the night queues, the weekend queues allow for longer running jobs. The queues are active from Friday afternoon to Monday morning, enough time for a 64 hour job. The weekend queues are wkend4max and wkend8max, with a maximum of 4 and 8 nodes respectively, with wkend4max jobs having higher priority than wkend8max jobs.

3) The common pool

The third way to get more cores is to use a queue containing additional nodes from a 'common' pool provided by IARC. These nodes are available at any time of day in a variety of queues up to 8 hours long, at night in the 16-hour queues, and on weekends in the 64-hour queues. See below for examples.


Tutorial (under construction)

Compile
MPI is provided by mpich, or OpenMPI, built with the Portland Group Compiler. 
MPI compiler commands are mpicc, mpif90, etc., i.e.

$  mpicc -o mpi_wrapper mpi_wrapper.c

Queues
The queue structure is varied and rich. Each node has it's own set of home 
queues, the names of which are specific to the node owner, and usable on any 
node owned by the user. For a machine called 'host64', the queues available 
might be:

host     infinite run time, highest priority queue, max 64 cores on home node
host-1hr 1 hour runtime, 2nd highest priority, max 3 nodes, 60 cores/node
host-2hr 2 hour runtime, 3rd highest priority, max 3 nodes, 60 cores/node
host-4hr 4 hour runtime, 4th highest priority, max 3 nodes, 60 cores/node
host-8hr 8 hour runtime, 5th highest priority, max 3 nodes, 60 cores/node


The over-night queues are available to all. They run jobs from 4:00 p.m. to 9:00 a.m.:
night4max 16 hour runtime, max 4 nodes, 60 cores/node
night8max 16 hour runtime, max 8 nodes, 60 cores/node

The weekend queue runs jobs from 4:00 p.m. Friday to 9:00 a.m. Monday:
wkend4max 64 hour runtime, max 4 nodes, 60 cores/node
wkend8max 64 hour runtime, max 8 nodes, 60 cores/node

night4max has higher priority than night8max, and the lower priority than
the 8 hour queues, and similarly for the weekend queues. Jobs submitted 
to a night queue during the day are queued, and wait to rununtil resources 
become available after 4:00 p.m. It is important to estimate your runtime 
accurately, as the available time in the queue is decremented hourly as the 
night progresses, ensuring that jobs finish by 9:00 a.m.


Queue commands
To look at queue information, use the sinfo command. You will be shown only 
those queues in which you have permission to run:

node01:$ sinfo
PARTITION  AVAIL  TIMELIMIT  NODES  STATE NODELIST
node01        up   infinite      1   idle node01
node01-1hr    up    1:00:00      3   idle node[01,03-04]
node01-2hr    up    2:00:00      3   idle node[01,03-04]
node01-4hr    up    4:00:00      3   idle node[01,03-04]
node01-8hr    up    8:00:00      3   idle node[01,03-04]

To see all queues, use the '-a' flag:

node01:$ sinfo -a
PARTITION  AVAIL  TIMELIMIT  NODES  STATE NODELIST
node01        up   infinite      1   idle node01
host          up   infinite      1   idle host64
node01-1hr    up    1:00:00      3   idle node[01,03-04]
node01-2hr    up    2:00:00      4   idle node[01,03-04]
node01-4hr    up    4:00:00      4   idle node[01,03-04]
node01-8hr    up    8:00:00      4   idle node[01,03-04]
host-1hr      up    1:00:00      5   idle host64,node[01,03-04]
host-2hr      up    2:00:00      5   idle host64,node[01,03-04]
host-4hr      up    4:00:00      5   idle host64,node[01,03-04]
host-8hr      up    8:00:00      5   idle host64,node[01,03-04]
etc.

note:
Under PARTITION, a "*" identifies the default partition, if there is one.
Under STATE, a '*' indicates a node not responding

To see your active queues (jobs), use the squeue command:

$ squeue
        JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
           56     night     host     boss PD       0:00      1 (ReqNodeNotAvail)

Once you have the JOBID of your job, you can cancel it with scancel:

$ scancel 56
$ squeue
        JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
        
To see all jobs on the cluster, do squeue -a

Running your code
First, decide which directories on your node should be exported to other nodes. 
/home, /opt, and /usr are automatically exported. Any other directories to be 
exported are done so with the environment variable CCMOUNTS. For example, to 
have slave nodes NFS mount '/data/mydir' and '/raid0', set CCMOUNTS in either 
the submission shell, or the job script:

$ export CCMOUNTS="/data/mydir,/raid0"

where multiple directories are separated by either commas or spaces. Directories 
which may not be exported are /, /dev, /lib, /lib64, /proc, /tmp, and /var.

The CCC software will automatically export the file systems to the nodes 
allocated to your job, create necessary mount points, and un-export/clean-up 
when finished, leaving the slave nodes in their original state.

Note that the exported directories are private to the job and its children, 
totally transparent to the operating system and other jobs in the slave node's 
process space.


From the command line (see 'man srun'):
$ srun -n64 -p <partition> <command> <command options>
i.e.
$ srun -n64 -p host hostname -I

runs <command> on 64 cores, where -p is the partition (queue) to use. Pick
a partition from the 'sinfo' command, looking for one with your node name 
and appropriate length of time, and an idle NODELIST.

MPI
If <command> is an mpi job, run it in a script (here called 'myjob.sh'), 
with optional prolog and epilog commands (red text to be changed by user):

$ cat myjob.sh
#!/bin/bash

# optional prolog commands next
<optional commands>

# use srun to generate hostfile for MPI
srun hostname > nodes$$

# run the MPI code
mpirun -np $SLURM_NTASKS -hostfile nodes$$ /path/to/<command>

# remove hostfile
rm nodes$$

# optional epilog commands next
<optional commands>



and run the job with the slurm allocation statement 'salloc':

$ salloc -N4 -n128 --ntasks-per-node=32 -p host-8hr ./myjob.sh

where
-N4                    use 4 nodes
-n128                  128 tasks across the 3 nodes
--ntasks-per-node=32   no more than 32 tasks per node (*)
-p host-8hr            is the partition (queue) to use.

(*) note that the 64-core nodes have only 32 floating point units, so 
if your code is floating point intensive, use only 32 ntasks-per-node


If you cancel the job, or the job is killed for running over time, 
you may have to kill 'mpirun' by hand.


Submit batch job (preferred method, see 'man sbatch'):
$ sbatch <shell script>

example shell script, which can have optional prolog/epilog commands as above:

$ cat <shell script> (Note that '#SBATCH' arguments are also 'salloc' arguments)
#!/bin/bash
#SBATCH -N4                      <--- number of nodes
#SBATCH -n128                    <--- number of total execution threads
#SBATCH --ntasks-per-node=32     <--- <= 32 execution threads/node (*)
#SBATCH -p host-8hr              <--- 8 hour queue for my node
#SBATCH --mail-user=my@email.edu <--- get email based on following event:
#SBATCH --mail-type=ALL          <--- BEGIN or END or FAIL or REQUEUE or ALL

# optional prolog commands next
<optional commands>

# use srun to generate hostfile for MPI
srun hostname > nodes$$

# run the MPI code
mpirun -np $SLURM_NTASKS -hostfile nodes$$ /path/to/<command>

# remove hostfile
rm nodes$$

# optional epilog commands next
<optional commands>



(*) -n128 is supposed to dominate, but this script will actually run 
129 exection threads, whereas 'salloc' with the same options runs 128


If you cancel the job, or the job is killed for running over time, 
you may have to kill 'mpirun' by hand.

To examine the run environment for a job, get its JOBID from 'squeue', and do
$ scontrol show job 368
JobId=368 Name=run128.sh
   UserId=aUser(1000) GroupId=aUser(1000)
   Priority=4294901726 Nice=0 Account=(null) QOS=(null)
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0
   RunTime=01:25:16 TimeLimit=04:00:00 TimeMin=N/A
   SubmitTime=2015-03-17T13:12:58 EligibleTime=2015-03-17T13:12:58
   StartTime=2015-03-17T13:12:59 EndTime=2015-03-17T17:12:59
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   Partition=host-4hr AllocNode:Sid=host64:52116
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=node0[1,3-4]
   BatchHost=node01
   NumNodes=3 NumCPUs=129 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   Socks/Node=* NtasksPerN:B:S:C=43:0:*:* CoreSpec=0
   MinCPUsNode=43 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) Gres=(null) Reservation=(null)
   Shared=0 Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/aUser/run128.sh
   WorkDir=/home/aUser
   StdErr=/home/aUser/slurm-368.out
   StdIn=/dev/null
   StdOut=/home/aUser/slurm-368.out


  
  • Node Acquisition

A node can be acquired from a vendor ($8K to $12K, depending upon configuration and edu discounts), or we can build one from parts as they become available. Node owners not affilitated with IARC agree to join the community paradigm, foregoing the option to run in stand-alone mode. The community paradigm provisions the node to others when not in use, reciprocally granting the node owner access to additional cores on other nodes.
Contact us for more information.

cccNodes

  • Cloud Integration

IARC affiliated users without their own node may launch HPC jobs from a properly configured virtual machine running in the IARC cloud. Jobs launched from a virtual machine run at lowest priority, and use the 10 Gb network for communication.

Slurm Logo OpenMPI Logo debian Logo


Valid HTML 4.01 Transitional