Documentations for AI cloud
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
docs_aicloud/aicloud_slurm
Thomas Arildsen f761071efa Updated time limit in "short" QoS 8 months ago
..
examples_Singularity_def_files a bit of post training cleanup 3 years ago
images Ready for Tuesday. Hurray 3 years ago
multi_gpu_keras Added link to multi-GPU Horovod training from the main doc 2 years ago
pip_in_containers Details on trouble with .local and similar directories 2 years ago
pytorch_anaconda_example Merge branch 'master' of git.its.aau.dk:CLAAUDIA/docs_aicloud 3 years ago
refs add first version of training matterials 4 years ago
spyder_ide Alternative recipe for building PyTorch image with Spyder 1 year ago
stable_view_synthesis Added Singularity recipe for StableViewSynthesis dependencies 1 year ago
tensorflow_keras_example updated to newer version to kee support with pip keras 2 years ago
tensorflow_spyder Fixed outdated TensorFlow-with-Spyder example 1 year ago
torch_amp_example Added example with Native Torch 2 years ago
training Minor updates to training slides 1 year ago
README.md Updated time limit in "short" QoS 8 months ago
TroubleShooting.md Details on trouble with .local and similar directories 2 years ago

README.md

Table of Contents

AI Cloud at AAU user information

This README contains information for getting started with the "AI Cloud" DGX-2 system at AAU, links to additional resources, and examples of daily usage. We are many users on this system, so please consult the section on Fair usage and follow the guidelines. We also have a community site at AAU Yammer where users can share experiences and we announce workshops, changes, and service to the system.

Introduction

Working on the AI Cloud is based on a combination of two different mechanisms, Singularity and Slurm, which are used for the following purposes:

  • Singularity is a container framework which serves to provide you with the necessary software environment to run your computational workloads. Different researchers may have widely different software stacks or perhaps versions of the same software stack that you need for your work. In order to provide maximum flexibility to you as users and to minimise potential compatibility problems between different software installed on the compute nodes, each user's software environment(s) is defined and provisioned as Singularity containers. You can both download pre-defined container images or configure or modify them yourself according to your needs.
    See details on container images from NGC further down.
  • Slurm is a queueing system that manages resource sharing in the AI Cloud. Slurm makes sure that all users get a fair share of the resources and get served in turn. Computational work in the AI Cloud can only be carried out through Slurm. This means you can only run your jobs on the compute nodes by submitting them to the Slurm queueing system. It is also through Slurm that you request the amount of ressources your job requires, such as amount of RAM, number of CPU cores, number of GPUs etc.
    See how to get started with Slurm further down.

Getting started

An alternative workshop version intro to the system is also available.

Logging in

After you have been created as a user on AI Cloud, you are now capable of accessing the platform using SSH.

Generally, the AI Cloud is only directly accessible when being on the AAU network, in this case you would access the front-end node by:

ssh <aau-ID>@ai-pilot.srv.aau.dk

Replace <aau-ID> with your personal AAU ID. For some users, this is your email (e.g. yourname@department.aau.dk). For newer users, your AAU ID may be separate from your email address and have the form AB12CD@department.aau.dk; if you have this, use your AAU ID to log in.

If you wish to access while not being connected to the AAU network, you have two options: Use VPN or use AAU's SSH JumpHost.

If you're often outside AAU, you can use the SSH JumpHost through your personal ssh config (Linux/MacOS often located in: $HOME/.ssh/config).

Host ai-pilot.srv.aau.dk
     User <aau-ID>
     ProxyJump %r@sshgw.aau.dk

Add the above configuration to your personal ssh config file (often located in: $HOME/.ssh/config on Linux or MacOS systems). Now you can easily connect to the platform regardless of network simply using ssh claaudia-ai-cloud.

Slurm basics

To get a first impression, try:

$ scontrol show node
NodeName=nv-ai-01.srv.aau.dk Arch=x86_64 CoresPerSocket=24 
   CPUAlloc=52 CPUTot=96 CPULoad=19.52
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:16(S:0-1)
   NodeAddr=nv-ai-01.srv.aau.dk NodeHostName=nv-ai-01.srv.aau.dk 
   OS=Linux 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 10:07:26 UTC 2020 
   RealMemory=1469490 AllocMem=661304 FreeMem=1387641 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=batch 
   BootTime=2020-09-16T19:10:12 SlurmdStartTime=2020-09-17T11:24:52
   CfgTRES=cpu=96,mem=1469490M,billing=271,gres/gpu=16
   AllocTRES=cpu=52,mem=661304M,gres/gpu=16
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
   

NodeName=nv-ai-03.srv.aau.dk Arch=x86_64 CoresPerSocket=24 
   CPUAlloc=44 CPUTot=96 CPULoad=48.55
   AvailableFeatures=(null)
   ActiveFeatures=(null)
   Gres=gpu:16(S:0-1)
   NodeAddr=nv-ai-03.srv.aau.dk NodeHostName=nv-ai-03.srv.aau.dk 
   OS=Linux 4.15.0-72-generic #81-Ubuntu SMP Tue Nov 26 12:20:02 UTC 2019 
   RealMemory=1469490 AllocMem=1404304 FreeMem=306301 Sockets=2 Boards=1
   State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
   Partitions=batch 
   BootTime=2020-09-16T19:02:40 SlurmdStartTime=2020-09-16T19:03:35
   CfgTRES=cpu=96,mem=1469490M,billing=271,gres/gpu=16
   AllocTRES=cpu=44,mem=1404304M,gres/gpu=15
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

The above command shows the names of the nodes, partitions (batch), etc. involved in the Slurm system. The key lines to determine the current resource utilization are:

CfgTRES=cpu=96,mem=1469490M,billing=271,gres/gpu=16
AllocTRES=cpu=44,mem=1404304M,gres/gpu=15

here we can see that 44 out of 96 CPUs are allocated/reserved, 1.404TB out of 1.469TB memory is allocated/reserved and 15 out of 16 GPUs are allocated/reserved.

Slurm allocate resources

We can then allocate resources for us. Let us say we would like to allocate one GPU (NVIDIA V100)

salloc --gres=gpu:1
salloc: Pending job allocation 1612
salloc: job 1612 queued and waiting for resources
salloc: job 1612 has been allocated resources
salloc: Granted job allocation 1612
salloc: Waiting for resource configuration
salloc: Nodes nv-ai-03.srv.aau.dk are ready for job

We can then check the queue

squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               109     batch     bash tlj@its.  R       3:41      1 nv-ai-03.srv.aau.dk

Here the state (ST) is "running" ( R). If there were not enough resources, the state would be "pending" (PD).

You can then run jobs with this allocation using srun - like srun <bash command>

$ srun df -h
Filesystem                 Size  Used Avail Use% Mounted on
udev                       756G     0  756G   0% /dev
tmpfs                      152G  4.8M  152G   1% /run
/dev/md0                   879G  275G  560G  33% /
tmpfs                      756G  124M  756G   1% /dev/shm
tmpfs                      5.0M     0  5.0M   0% /run/lock
tmpfs                      756G     0  756G   0% /sys/fs/cgroup
/dev/nvme1n1p1             511M  6.1M  505M   2% /boot/efi
/dev/md1                    28T   12T   15T  45% /raid
nv-ai-03.srv.aau.dk:/user   30T   29T  1.7T  95% /user

If you have a job on a node, you can also ssh to that node

$ ssh nv-ai-01
Welcome to NVIDIA DGX Server Version 4.7.0 (GNU/Linux 4.15.0-134-generic x86_64)

  System information as of Tue Aug 24 15:17:03 CEST 2021

  System load:  64.83               Processes:            1367
  Usage of /:   31.2% of 878.57GB   Users logged in:      0
  Memory usage: 5%                  IP address for bond0: 172.19.20.98
  Swap usage:   0%
Last login: Tue Aug 24 15:14:37 2021 from 172.19.8.14
tlj@its.aau.dk@nv-ai-01:~$ 

You can view information on your job using

scontrol show job <JOBID>

or additional details like the GPU IDX using

scontrol -d show job <JOBID>

The allocation can be relinquished with

exit

or with a Slurm cancel statement

scancel <JOBID>

Slurm QoS

By default, jobs are run with the 'normal' Quality-of-Service (QoS). If you need several/additional GPUs (multi GPU) or longer run time, use this query command to find out and choose the suitable QoS for your need.

$ sacctmgr show qos format=name,maxtresperuser%20,maxwalldurationperjob
      Name            MaxTRESPU     MaxWall
---------- -------------------- -----------
    normal    cpu=20,gres/gpu=1  2-00:00:00
     short    cpu=32,gres/gpu=4    03:00:00
   allgpus    cpu=48,gres/gpu=8 21-00:00:00
  1gpulong    cpu=16,gres/gpu=1 14-00:00:00
 admintest   cpu=96,gres/gpu=16  1-00:00:00
      1cpu                cpu=1    06:00:00
  deadline    cpu=64,gres/gpu=8 14-00:00:00 

To make an example, it is possible to allocate two GPUs by

salloc --qos=allgpus --gres=gpu:2

where 'allgpus' above can be one of the following:

  • normal: for one-GPU job (the default QoS).
  • short: for 1 or more small GPU jobs - possible for testing batch submission or interactive jobs.
  • allgpus: for 1 or more large GPU jobs.
  • 1gpulong: for 1 large one-GPU job.
  • admintest: special QoS only usable by administrators for full node testing.
  • 1cpu: assigned to inactive student users after each semester (no GPU).
  • deadline: a QoS where users with a hard publication deadline can apply for access to. To get access, please follow this guide.

Besides this, jobs in smaller QoS groups in general have a higher priority, such that they will tend to be allocated before jobs with a lower priority. Jobs submitted in the 'deadline' QoS have the highest priority.

Getting your (Singularity) environment up

It is possible to run everything from a single line. First pull a docker image and convert it to a singularity image

srun singularity pull docker://nvcr.io/nvidia/cuda:latest

This may take some time. There is now an image called 'cuda_latest.sif' that we can make use of. The address 'docker://nvcr.io/nvidia/cuda:latest' identifies where to retrieve the image on NVIDIA GPU Cloud (NGC) - more about this further down.
Notice that we used both Slurm (srun) and Singularity (singularity) above to retrieve an image. srun is for executing the actual job in AI Cloud. singularity must be executed on one of the DGX-2 compute nodes, which srun takes care of. You cannot execute singularity directly on the front-end node. singularity here retrieves the specified (Docker) image from NGC and automatically converts it to a Singularity image.

To execute a command at the compute node with certain specified resources, we can do

$ srun --gres=gpu:1 singularity exec --nv docker://nvcr.io/nvidia/cuda:latest nvidia-smi
srun: job 264 queued and waiting for resources
srun: job 264 has been allocated resources
WARNING: group: unknown groupid 140195
Mon Apr 15 12:54:43 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104      Driver Version: 410.104      CUDA Version: 10.0     |
|-------------------------------|----------------------|----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM3...  Off  | 00000000:34:00.0 Off |                    0 |
| N/A   33C    P0    52W / 300W |      0MiB / 32480MiB |      0%      Default |
+-------------------------------|----------------------|----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Here --gres=gpu:1 states that we are want to allocate one GPU. The --nv argument states that we are using the nvidia runtime environment.

A few words on container images

In the previous section, we saw how to retrieve an image from NGC and how to instantiate that image as a Singularity container and execute a command in it. The image was retrieved from NVIDIA GPU Cloud (NGC). NGC is NVIDIA's official repository with many different images of useful software environments for deep learning and other GPU-accelerated tasks. The images in NGC have been specially built for NVIDIA GPUs and can be very convienient to use out-of-the-box instead of having to configure your software environment from scratch yourself.

NGC's container images are Docker images, but Singularity can convert them on the fly to run as Singularity containers. Docker itself is not used in AI Cloud due to secutiry concerns in multi-user environments. You can also use images from other Docker repositories such as Docker Hub. Remember to look for images that are built with support for NVIDIA GPUs.

You can read more here about how to Build images from scratch or modify images from NGC.

Once images are built, they are immutable meaning that you cannot install additional software inside the images themselves at container runtime. Sometimes, especially when running Python software, it can be convenient to install additional packages into your runtime environment using pip. This can cause special challenges which you can read more about and see how to solve in Installing Python packages with pip in Singularity containers. See also Troubleshooting.

Where to save your files

The front-end node and compute nodes have access to a distributed file system in '/user' based on NFS. To see this, first create a file on the front-end node in your home directory

echo "I have just created this" > example.txt

This is available on the front-end node. But you can also "see" this on a compute node

srun cat example.txt

Using the scratch space

Each compute node has a scratch space for storing temporary data. It is a RAID0 NVME (SSD) partition with high disk I/O capacity. Following are two ways you can use it for your jobs:

Via interactive bash session:

Since the RAID scratch space is local to each compute node, you need to specify the exact node you want to use with the --nodelist argument as below:

srun --pty --nodelist=nv-ai-01.srv.aau.dk bash -l
srun: job 575 has been allocated resources

vle@its.aau.dk@nv-ai-01:~$ ls /raid/
vle@its.aau.dk@nv-ai-01:~$ mkdir -p /raid/its.vle # create a folder to hold your data. It's a good idea to use this path pattern: /raid/<subdomain>.<username>.
vle@its.aau.dk@nv-ai-01:~$ cp -a /user/its.aau.dk/vle/testdata /raid/its.vle/
vle@its.aau.dk@nv-ai-01:~$ exit # quit interactive session.

After the data has been copied to the scratch folder, you can use it by referring to the data in your code. For example:

srun --pty --nodelist=nv-ai-01.srv.aau.dk ls /raid/its.vle/testdata

Via sbatch job script

You can script the whole chain of commands into an sbatch script

#!/usr/bin/env bash
#SBATCH --job-name MySlurmJob # CHANGE this to a name of your choice
#SBATCH --partition batch # equivalent to PBS batch
#SBATCH --time 24:00:00 # Run 24 hours
#SBATCH --qos=normal # possible values: short, normal, allgpus, 1gpulong
#SBATCH --gres=gpu:1 # CHANGE this if you need more or less GPUs
#SBATCH --nodelist=nv-ai-01.srv.aau.dk # CHANGE this to nodename of your choice. Currently only two possible nodes are available: nv-ai-01.srv.aau.dk, nv-ai-03.srv.aau.dk
##SBATCH --dependency=aftercorr:498 # More info slurm head node: `man --pager='less -p \--dependency' sbatch`

## Preparation
mkdir -p /raid/its.vle # create a folder to hold your data. It's a good idea to use this path pattern: /raid/<subdomain>.<username>.

if [ !-d /raid/its.vle/testdata ]; then
     # Wrap this copy command inside the if condition so that we copy data only if the target folder doesn't exist
     cp -a /user/its.aau.dk/vle/testdata /raid/its.vle/

fi

## Run actual analysis
## The benefit with using multiple srun commands is that this creates sub-jobs for your sbatch script and be uded for advanced usage with SLURM (e.g. create checkpoints, recovery, ect)
srun python /path/to/my/python/script --arg1 --arg2
srun echo finish analysis

A script such as the above can be submitted to the Slurm queue using the sbatch command

sbatch <name-of-script>

Transferring files

You can transfer files to/from AI Cloud using the command line utility scp from your local computer (Linux and OS X). To AI Cloud:

$ scp some-file <USER ID>@ai-pilot.srv.aau.dk:~

where '~' is your user folder on AI Cloud. <USER ID> could for example be ‘ab34ef@department.aau.dk’.
You can append folders below that to your destination:

$ scp some-file <USER ID>@ai-pilot.srv.aau.dk:~/some-folder/some-subfolder/

From AI Cloud:

$ scp <USER ID>@ai-pilot.srv.aau.dk:~/some-file some-local-folder

In general, file transfer tools that can use SSH as protocol should work. A common choice is FileZilla or the Windows option WinSCP.

If you wish to mount a folder in AI Cloud on your local computer for easier access, you can also do this using sshfs (Linux command line example executed on your local computer):

mkdir aicloud-home
sshfs <USER ID>@ai-pilot.srv.aau.dk:/user/<DOMAIN>/<ID> aicloud-home

where <DOMAIN> is 'department.aau.dk' and <ID> is 'ab34ef' for user 'ab34ef@department.aau.dk'.

Ways of using Slurm

Just to summarise from the above examples, there are three typical was of executing work through Slurm:

  • Allocate resources: use salloc (see Slurm allocate resources. This reserves the requested resources for you which you can then use when they are ready. This means you can for example log into the specified compute node(s) to use it/them interactively (you cannot do this without having allocated resources on them).
  • Run command directly through Slurm: use srun (see Getting your (Singularity) environment up). This runs the specified command directly on the requested resources as soon as they are ready. The srun command will block until done.
  • Schedule one or more jobs to run whenever ressources become ready: use sbatch (example in Via sbatch job script). This command lets you specify the details of your job in a script which you submit to the queue via sbatch. This is convenient if for example you have a large job consisting of many steps or many jobs that you want to specify at once and just leave it to Slurm to get the work done when resources are ready.
    This is the most convenient way to run jobs once you know in advance what you need done. It allows you to specify even very large and complicated amounts of work here and now and then just leave it to Slurm to get things run as soon as resources are available. This way, you will not have to sit around and wait for it. The sbatch command returns immediately and you can then use squeue to inspect where your jobs are in the queue.

Too many open files

Theres a limit on the number of open files you can have the compute nodes. These limits on the compute nodes are drawn from the limits on the login node. If you need to work with many files, you might need to increase the default value. Too see this and alter the default limit you can do

nv-ai-fe01:~$ ulimit -n
4096
nv-ai-fe01:~$ srun --pty bash -c 'ulimit -n'
4096
nv-ai-fe01:~$ ulimit -n 16384
nv-ai-fe01:~$ srun --pty bash -c 'ulimit -n'
16384

Examples

Interactive TensorFlow

First we pull a TensorFlow image

srun singularity pull docker://nvcr.io/nvidia/tensorflow:19.03-py3

The pull address of the container can be found from the NGC catalog.

We can then do

srun --gres=gpu:1 --pty singularity shell --nv tensorflow_19.03-py3.sif

or by reference to the docker image 'docker://nvcr.io/nvidia/tensorflow:19.03-py3'

You now have shell access

Singularity tensorflow_19.03-py3.sif:~>

Documentation and examples are available in '/workspace/'. You can exit the interactive session with

exit

Inspecting your utilization

It is recommended practice, after you have configured your environment/pipeline, that you inspect your GPU utilization [%], and possibly memory utilization, to see if you

  1. indeed are utilizing the GPU as expected.
  2. achieve a somewhat acceptable level of GPU utilization.

You can do this with the nvidia-smi command, by executing e.g. the following in your environment. First, get an interactive resource:

srun --gres=gpu:1 --pty singularity shell --nv myimage.sif

Start observing with nvidia-smi:

nvidia-smi -query-gpu=index,timestamp,utilization.gpu,utilization.memory,memory.total,memory.used,memory.free --format=csv -l 5 > util.csv &

then start your (small) code in the background using e.g.

python .....

Afterwards have a look at the reported utilization

Singularity> cat util.csv
index, timestamp, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.used [MiB], memory.free [MiB]
0, 2020/11/12 13:48:11.412, 99 %, 52 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 13:48:16.415, 99 %, 90 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 13:48:21.417, 99 %, 51 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 13:48:26.419, 98 %, 61 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 13:48:31.424, 100 %, 88 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 13:48:36.427, 32 %, 33 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 13:48:41.428, 98 %, 70 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 13:48:46.435, 100 %, 91 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 13:48:51.437, 99 %, 48 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 13:48:56.440, 100 %, 91 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 13:49:01.441, 99 %, 38 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 13:49:06.443, 97 %, 63 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 13:49:11.444, 35 %, 36 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 13:49:16.446, 98 %, 72 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 13:49:21.447, 100 %, 88 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 13:49:26.449, 98 %, 68 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 13:49:31.451, 98 %, 68 %, 32480 MiB, 30994 MiB, 1486 MiB

Now we are certain that the code and software is set up to utilize the GPU.

You can also do this by the script getUtilByJobId after job submission.

nv-ai-fe01:~$ getUtilByJobId.sh 83549
To end, do a CRTL-C

utilization.gpu
Percent of time over the past sample period during which one or more kernels was executing on the GPU.
The sample period may be between 1 second and 1/6 second depending on the product.

utilization.memory
Percent of time over the past sample period during which global device memory was being read or written.
The sample period may be between 1 second and 1/6 second depending on the product.

memory.total
Total installed GPU memory.

memory.used
Total memory allocated by active contexts.

memory.free
Total free memory.
salloc: Granted job allocation 84172
salloc: Waiting for resource configuration
salloc: Nodes nv-ai-03.srv.aau.dk are ready for job
tlj@its.aau.dk@nv-ai-03.srv.aau.dk's password: 
index, timestamp, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.used [MiB], memory.free [MiB]
0, 2020/11/12 14:02:14.854, 98 %, 66 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 14:02:19.858, 0 %, 0 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 14:02:24.861, 48 %, 42 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 14:02:29.863, 99 %, 72 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 14:02:34.864, 72 %, 57 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 14:02:39.868, 100 %, 68 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 14:02:44.871, 99 %, 66 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 14:02:49.872, 99 %, 66 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 14:02:54.875, 99 %, 68 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 14:02:59.884, 11 %, 7 %, 32480 MiB, 30994 MiB, 1486 MiB
^Csalloc: Relinquishing job allocation 84172

where 83549 is the Slurm JobId. If your job does not behave as intended, then please analyse the problem and try to solve the issue. It is not good practice to have allocated but not utilizing GPUs on the system. If you cannot find the issue, please email support@its.aau.dk with as many details as possible, preferably with a minimal working example.

TensorFlow and Keras example with own custom setup

In this example we will set everything up on the compute node by checking out a Git repository. Notice that some of the steps might not need to be executed for every run when e.g. code, software or data is available.

Clone the repository locally

git clone https://git.its.aau.dk/CLAAUDIA/docs_aicloud.git

Change directory

cd docs_aicloud/aicloud_slurm/tensorflow_keras_example/

and build our image (see file Singularity: there iss some Python + TensorFlow + Keras).

srun singularity build --fakeroot tensorflow_keras.sif Singularity

This may take some time. Notice that when we build based on our own recipe file we need to add --fakeroot, see more here

We will also create an 'output_data' dir

mkdir output_data

We can now run. Here we map the local directory as '/code' inside the Singularity container and then execute the Python program example.py with ten epocs

srun --gres=gpu:1 singularity exec --nv -B .:/code -B output_data:/output_data tensorflow_keras.sif python /code/example.py 10

Results are now available in 'output_data'.

TensorFlow and Jupyter notebook using TensorFlow container

We have provided a script for this. You can do the following and type in a password

$ /user/share/scripts/jupyter.sh
Using tensorflow.sif
srun: job 45060 queued and waiting for resources
srun: job 45060 has been allocated resources
slurmstepd: task_p_pre_launch: Using sched_affinity for tasks
Enter password: 
Verify password: 
[NotebookPasswordApp] Wrote hashed password to /user/its.aau.dk/tlj/.jupyter/jupyter_notebook_config.json
LIST
Point your browser to http://nv-ai-01.srv.aau.dk:8888
Press any key to close your jupyter server

The first time, this will download a new TensorFlow image. Then follow the guide printed in your terminal window on how to open the Jupyter notebook in a browser and type in your password. You have to be on the AAU network (on campus or VPN). Try e.g. new->Python 3 and execute the following cell.

!nvidia-smi

or

from tensorflow.python.client import device_lib
device_lib.list_local_devices()

You should see you have one V100 GPU available. You can close again by pressing any key in the terminal, or cancel the slurm allocation.

PyTorch and Anaconda

Some images, like the PyTorch images from NGC, come with Anaconda, which is a widely used Python distribution. In this example we will build a PyTorch image and install additional Anaconda packages in the image.

First build our Singularity image from a Docker PyTorch image and install additional conda packages. The Singularity file is

BootStrap: docker
From: nvcr.io/nvidia/pytorch:20.03-py3

%post
/opt/conda/bin/conda install -c anaconda beautifulsoup4 

Go into the folder 'docs_aicloud/aicloud_slurm/pytorch_anaconda_example' and build using

srun singularity build --fakeroot pytorch.sif Singularity

Again this may take some time. Notice that we pull the PyTorch Docker image from NGC

Next we can, e.g., run our container in interactive mode

srun --pty --gres=gpu:1 singularity shell --nv pytorch.sif

and we can then use the Anaconda Python distribution to for example run IPython

$ ipython
Python 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31) 
Type 'copyright', 'credits' or 'license' for more information
IPython 7.12.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import torch

In [2]: import bs4

PyTorch and multi-precision training

The NVIDIA Tesla V100 comes with specialized hardware for tensor operations called tensor cores. For the V100, the tensor cores work on integers or half-precision floats and the default in many DNN frameworks is single precision. Additional changes to the code are then necessary to activate half-precision models in cuDNN and utilize the tensor core hardware for:

  1. Faster execution
  2. Lower memory footprint that allows for an increased batch size.

An example on how to adapt your PyTorch code is provided here. The example uses APEX automatic multi precision AMP and native Torch AMP available in NGC from version 20.06.

Multi-GPU data parallelism training with Horovod and Keras

The NVIDIA DGX-2 comes with specialized hardware for moving data between GPUs: NVLinks and NVSwitches. One approach to utilizing these links is using the MVDIA Collective Communication Library (NCCL). NCCL is compatible with the Message Passing Interface (MPI) used in many HPC applications and facilities. This in turn is build into the Horovod framework for data parallelism training supporting many deep learning frameworks requiring only minor changes in the source code. In this example we show how to run Horovod on our system, including Slurm settings. You can then adapt this example for you preferred framework as described in the Horovod documentation

TensorFlow 2 with Spyder GUI

It is possible to start a GUI in the Singularity container and show graphical elements. In this example we will start the IDE Spyder using X11 forwarding. First connect to the AI Cloud with X11 forwarding enabled

ssh <aau ID>@ai-pilot.srv.aau.dk -X

From the Git repository hosting this documentation, go to the folder 'aicloud_slurm/spyder_ide/' with the file 'tensorflow_spyder.def' containing:

BootStrap: docker
From: nvcr.io/nvidia/tensorflow:21.06-tf2-py3

%post
    apt-get update
    DEBIAN_FRONTEND=noninteractive apt-get -yq install xorg x11-apps libxkbcommon-x11-0 alsa
    pip install pandas
    pip install spyder

Build the image by running:

srun --cpus-per-task=6 singularity build --fakeroot tensorflow_spyder.sif tensorflow_spyder.def

Execute with (notice the --x11 flag to enable X11 forwarding via the Slurm scheduler)

srun --gres=gpu:1 --x11 singularity exec --nv tensorflow_spyder.sif spyder

You should now see the Spyder GUI (it may take a few seconds to appear depending on your connection bandwidth).

PyTorch with Spyder GUI

Similar to the TensorFlow example above, we can also install and run Spyder in a container based on PyTorch. From the same starting point as the TensorFlow example above, use the file 'pytorch_spyder.def' containing:

Bootstrap: docker
From: nvcr.io/nvidia/pytorch:21.06-py3

%post
    export DEBIAN_FRONTEND=noninteractive
    apt-get update
    apt-get install -y libgl1 python3-pyqt5.qtwebkit
    /opt/conda/bin/conda install -y spyder
    unset DEBIAN_FRONTEND

Build the image by running:

srun --cpus-per-task=6 singularity build --fakeroot pytorch_spyder.sif pytorch_spyder.def

Execute with (notice the --x11 flag to enable X11 forwarding via the Slurm scheduler)

srun --gres=gpu:1 --x11 singularity exec --nv pytorch_spyder.sif spyder

If you need other versions of the PyTorch base image (such as 20.11 in the example below), this version of the recipe may be more robust, but it provides an older version of Spyder:

Bootstrap: docker
From: nvcr.io/nvidia/pytorch:20.11-py3

%post
    export DEBIAN_FRONTEND=noninteractive
    apt-get update
    apt-get install -y libgl1 spyder3
    unset DEBIAN_FRONTEND

For this alternative version, build the image as above but execute as follows:

srun --gres=gpu:1 --x11 singularity exec --nv -B /run/user/`id -u` pytorch_spyder.sif spyder3

Matlab

It is possible to run Matlab both with and without GUI

First, build your Matlab image, e.g. like

srun --cpus-per-task=6 singularity build matlab.sif docker://nvcr.io/partners/matlab:r2019b

Then we need to set an environment variable such that matlab knows your license. In this case its convenient to point to the AAU license server

export MLM_LICENSE_FILE=27000@matlab.srv.aau.dk

Now you can start Matlab with pure command line as

srun --pty --gres=gpu:1 singularity exec --nv matlab.sif matlab -nodesktop

or with GUI (if your SSH connection has X11 forwarding enabled)

srun --pty --x11 --gres=gpu:1 singularity exec --nv matlab.sif matlab

Priority

In August 2020 the queueing algorithm was changed. In the following we would like to describe this algorithm such that it is clear in which order jobs are being allocated resources.

The method is a multifactor method - a higher priority means the job will be executed before jobs with a lower priority. You can read an introduction here. The main motivation was to allow people with an upcoming deadline higher priority - secondarily to encourage smaller jobs, both of which are difficult with the previously used scheduling method: first-in-first-out (FIFO) with backfill. The key factors we use in this setup are:

  1. Age (priority is increased as the job sits in the queue)
  2. Fairshare (priority is increased for users, or parts of the organization with a low resource utilization)
  3. Jobsize (priority is increased for users requesting fewer CPUs)
  4. QoS (priority is increased for users requesting resources in more resource-strict QoS, except 1CPU)

The total priority is computed as a weighted sum of factors (normalized [0,1]) times the weight of each factor. For our setup, this is

Job_priority =
	(PriorityWeightAge) * (age_factor) +
	(PriorityWeightFairshare) * (fair-share_factor) +
	(PriorityWeightPartition) * (partition_factor) +
	(PriorityWeightQOS) * (QOS_factor) +
	1000

The term 1000 is a constant partition factor, so for now it can be ignored.

In the following we will show the weights and calculations for each step.

You can see the weights by:

$ sprio -w
          JOBID PARTITION   PRIORITY       SITE        AGE  FAIRSHARE    JOBSIZE  PARTITION        QOS
        Weights                               1       2000       4000       2000       1000       5000

The Age is normalized to max out after 14 days

$ scontrol show config | grep PriorityMaxAge
PriorityMaxAge          = 14-00:00:00

after which it will have a factor of 1. The factor only increases if the job is eligible (within constraints and not waiting for dependencies).

The job size factor is from 0.5 (priority 1000) from a user requesting all CPUs (96 on a single node), or 0.25 (500) for 2*96=192 CPUs, to 0.75 (priority 1500) for "0" CPUs, or a priority contribution of 1490 for a single CPU.

The fair share factor is based on "classic fair share". Computations are little complicated. But you can see your current share factor by running

$ sshare

The QoS factor is obtained from

$ sacctmgr show qos format=name,maxtresperuser%20,maxwalldurationperjob,priority
      Name            MaxTRESPU     MaxWall   Priority 
---------- -------------------- ----------- ---------- 
    normal    cpu=20,gres/gpu=1  2-00:00:00         20 
     short    cpu=32,gres/gpu=4    03:00:00        500 
   allgpus    cpu=48,gres/gpu=8 21-00:00:00          0 
  1gpulong    cpu=16,gres/gpu=1 14-00:00:00         10 
 admintest   cpu=96,gres/gpu=16  1-00:00:00         10 
      1cpu                cpu=2    06:00:00        500 
  deadline    cpu=64,gres/gpu=8 14-00:00:00       5000 

and then normalized by the highest QoS (5000). In order to calculate the contribution to the priority, you need take the QoS priority and multiply by 5000/5000 = 1. To read more and obtain access to the high-priority deadline QoS, see section Do you have an upcoming deadline?

You can see the priorities of the queue by running

$ sprio

or sorted by priority

$ sprio -S -y

including the contributions from different factors and their weights.

Fair usage

The following guidelines are put in place to have a fair usage of the system for all users. The following text might be updated from time to time such that we can better serve all users.

ITS/CLAAUDIA work from the following principles for fair usage:

  • Good research is the success criterion and ITS/CLAAUDIA should lower the barrier for allowing this.
  • Researchers should enter on a level playing field.
  • ITS has an administrative and technical role and should in general not determine what research should have a higher priority. Students are vetted with recommendation of supervisor/staff backing that this is for research purposes.
  • Aim at the most open and unrestricted access model.

Based on these principles we kindly ask that all users consider the following guidelines:

  • Please be mindful of your allocations and refrain from allocating many resources without knowing/testing/verifying that you indeed can make good usage of the allocated resources.
  • Please be mindful and de-allocate the resources if you do no use them. Then other users can make good use of these.

If in doubt, you can do

squeue -u $USER

and inspect you own allocations. If you have allocations you are not using, then please cancel these resource allocations.

A few key points to remember:

  1. Please refrain from doing pre-emptive allocations. From the current load, we still conclude that there are enough resources if the resources are used wisely.
  2. There are resources available in the evenings/nights and weekends. If possible, start your job as a batch script (sbatch), and let it queue and rest while the computer does the work. Maybe even better, put the job to queue late in the afternoon or use -b, --begin option with your batch script, e.g. add the line
#SBATCH --begin=18:00:00

ITS/CLAAUDIA will keep analysing and observing the usage of the system to make the best use of the available resources based on the above principles and guidelines. If ITS/CLAAUDIA is in doubt, we will contact users and ask if the resource allocations are in line with the above principles and guidelines. We have previously contacted users in this regard, and will be more active in periods of high utilization.

Do you have an upcoming deadline?

If you are working towards an upcoming deadline, and find it difficult to have the resources you need, then please send an email to support@its.aau.dk with a URL (call for papers etc.) stating the deadline. We can provide some hints, help and possibly additional resources to help you meet your deadline.

Training

Twice a year (~April and November) we do a 2h training session covering Slurm and Singularity. You can obtain the material here. We have a recording of such a training session that you can obtain upon request by sending an email to support@its.aau.dk.

Data deletion

From time to time we observe if users no longer are listed in the central database. Users can be removed when the studies or employment ends. We will as a last option try to reach you by email. If this fails, we reserve the right to delete data in your home directory and on the compute node, e.g. /raid.

Additional resources

Video presentations

Training

Docker and singularity on an HPC system

NVIDIA note on singularity

SLURM and singularity presentation

Trouble Shooting