44 KiB
Table of Contents
- AI Cloud at AAU user information
- Introduction
- Getting started
- Examples
- Interactive TensorFlow
- Inspecting your utilization
- TensorFlow and Keras example with own custom setup
- TensorFlow and Jupyter notebook using TensorFlow container
- PyTorch and Anaconda
- PyTorch and multi-precision training
- Multi-GPU data parallelism training with Horovod and Keras
- TensorFlow 2 with Spyder GUI
- PyTorch with Spyder GUI
- Matlab
- Priority
- Fair usage
- Do you have an upcoming deadline?
- Training
- Data deletion
- Additional resources
AI Cloud at AAU user information
This README contains information for getting started with the "AI Cloud" DGX-2 system at AAU, links to additional resources, and examples of daily usage. We are many users on this system, so please consult the section on Fair usage and follow the guidelines. We also have a community site at AAU Yammer where users can share experiences and we announce workshops, changes, and service to the system.
Introduction
Working on the AI Cloud is based on a combination of two different mechanisms, Singularity and Slurm, which are used for the following purposes:
- Singularity is a container framework which serves to provide you with the necessary software environment to run your computational workloads. Different researchers may have widely different software stacks or perhaps versions of the same software stack that you need for your work. In order to provide maximum flexibility to you as users and to minimise potential compatibility problems between different software installed on the compute nodes, each user's software environment(s) is defined and provisioned as Singularity containers. You can both download pre-defined container images or configure or modify them yourself according to your needs.
See details on container images from NGC further down. - Slurm is a queueing system that manages resource sharing in the AI Cloud. Slurm makes sure that all users get a fair share of the resources and get served in turn. Computational work in the AI Cloud can only be carried out through Slurm. This means you can only run your jobs on the compute nodes by submitting them to the Slurm queueing system. It is also through Slurm that you request the amount of ressources your job requires, such as amount of RAM, number of CPU cores, number of GPUs etc.
See how to get started with Slurm further down.
Getting started
An alternative workshop version intro to the system is also available.
Logging in
After you have been created as a user on AI Cloud, you are now capable of accessing the platform using SSH.
Generally, the AI Cloud is only directly accessible when being on the AAU network, in this case you would access the front-end node by:
ssh <aau-ID>@ai-pilot.srv.aau.dk
Replace <aau-ID>
with your personal AAU ID. For some users, this is your email (e.g. yourname@department.aau.dk
). For newer users, your AAU ID may be separate from your email address and have the form AB12CD@department.aau.dk
; if you have this, use your AAU ID to log in.
If you wish to access while not being connected to the AAU network, you have two options: Use VPN or use AAU's SSH JumpHost.
If you're often outside AAU, you can use the SSH JumpHost through your personal ssh config (Linux/MacOS often located in: $HOME/.ssh/config
).
Host ai-pilot.srv.aau.dk
User <aau-ID>
ProxyJump %r@sshgw.aau.dk
Add the above configuration to your personal ssh config file (often located in: $HOME/.ssh/config
on Linux or MacOS systems). Now you can easily connect to the platform regardless of network simply using ssh claaudia-ai-cloud
.
Slurm basics
To get a first impression, try:
$ scontrol show node
NodeName=nv-ai-01.srv.aau.dk Arch=x86_64 CoresPerSocket=24
CPUAlloc=52 CPUTot=96 CPULoad=19.52
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:16(S:0-1)
NodeAddr=nv-ai-01.srv.aau.dk NodeHostName=nv-ai-01.srv.aau.dk
OS=Linux 4.15.0-101-generic #102-Ubuntu SMP Mon May 11 10:07:26 UTC 2020
RealMemory=1469490 AllocMem=661304 FreeMem=1387641 Sockets=2 Boards=1
State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=batch
BootTime=2020-09-16T19:10:12 SlurmdStartTime=2020-09-17T11:24:52
CfgTRES=cpu=96,mem=1469490M,billing=271,gres/gpu=16
AllocTRES=cpu=52,mem=661304M,gres/gpu=16
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
NodeName=nv-ai-03.srv.aau.dk Arch=x86_64 CoresPerSocket=24
CPUAlloc=44 CPUTot=96 CPULoad=48.55
AvailableFeatures=(null)
ActiveFeatures=(null)
Gres=gpu:16(S:0-1)
NodeAddr=nv-ai-03.srv.aau.dk NodeHostName=nv-ai-03.srv.aau.dk
OS=Linux 4.15.0-72-generic #81-Ubuntu SMP Tue Nov 26 12:20:02 UTC 2019
RealMemory=1469490 AllocMem=1404304 FreeMem=306301 Sockets=2 Boards=1
State=MIXED ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A
Partitions=batch
BootTime=2020-09-16T19:02:40 SlurmdStartTime=2020-09-16T19:03:35
CfgTRES=cpu=96,mem=1469490M,billing=271,gres/gpu=16
AllocTRES=cpu=44,mem=1404304M,gres/gpu=15
CapWatts=n/a
CurrentWatts=0 AveWatts=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
The above command shows the names of the nodes, partitions (batch), etc. involved in the Slurm system. The key lines to determine the current resource utilization are:
CfgTRES=cpu=96,mem=1469490M,billing=271,gres/gpu=16
AllocTRES=cpu=44,mem=1404304M,gres/gpu=15
here we can see that 44 out of 96 CPUs are allocated/reserved, 1.404TB out of 1.469TB memory is allocated/reserved and 15 out of 16 GPUs are allocated/reserved.
Slurm allocate resources
We can then allocate resources for us. Let us say we would like to allocate one GPU (NVIDIA V100)
salloc --gres=gpu:1
salloc: Pending job allocation 1612
salloc: job 1612 queued and waiting for resources
salloc: job 1612 has been allocated resources
salloc: Granted job allocation 1612
salloc: Waiting for resource configuration
salloc: Nodes nv-ai-03.srv.aau.dk are ready for job
We can then check the queue
squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
109 batch bash tlj@its. R 3:41 1 nv-ai-03.srv.aau.dk
Here the state (ST
) is "running" ( R
). If there were not enough resources, the state would be "pending" (PD).
You can then run jobs with this allocation using srun - like srun <bash command>
$ srun df -h
Filesystem Size Used Avail Use% Mounted on
udev 756G 0 756G 0% /dev
tmpfs 152G 4.8M 152G 1% /run
/dev/md0 879G 275G 560G 33% /
tmpfs 756G 124M 756G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 756G 0 756G 0% /sys/fs/cgroup
/dev/nvme1n1p1 511M 6.1M 505M 2% /boot/efi
/dev/md1 28T 12T 15T 45% /raid
nv-ai-03.srv.aau.dk:/user 30T 29T 1.7T 95% /user
If you have a job on a node, you can also ssh to that node
$ ssh nv-ai-01
Welcome to NVIDIA DGX Server Version 4.7.0 (GNU/Linux 4.15.0-134-generic x86_64)
System information as of Tue Aug 24 15:17:03 CEST 2021
System load: 64.83 Processes: 1367
Usage of /: 31.2% of 878.57GB Users logged in: 0
Memory usage: 5% IP address for bond0: 172.19.20.98
Swap usage: 0%
Last login: Tue Aug 24 15:14:37 2021 from 172.19.8.14
tlj@its.aau.dk@nv-ai-01:~$
You can view information on your job using
scontrol show job <JOBID>
or additional details like the GPU IDX using
scontrol -d show job <JOBID>
The allocation can be relinquished with
exit
or with a Slurm cancel statement
scancel <JOBID>
Slurm QoS
By default, jobs are run with the 'normal' Quality-of-Service (QoS). If you need several/additional GPUs (multi GPU) or longer run time, use this query command to find out and choose the suitable QoS for your need.
$ sacctmgr show qos format=name,maxtresperuser%20,maxwalldurationperjob
Name MaxTRESPU MaxWall
---------- -------------------- -----------
normal cpu=20,gres/gpu=1 2-00:00:00
short cpu=32,gres/gpu=4 03:00:00
allgpus cpu=48,gres/gpu=8 21-00:00:00
1gpulong cpu=16,gres/gpu=1 14-00:00:00
admintest cpu=96,gres/gpu=16 1-00:00:00
1cpu cpu=1 06:00:00
deadline cpu=64,gres/gpu=8 14-00:00:00
To make an example, it is possible to allocate two GPUs by
salloc --qos=allgpus --gres=gpu:2
where 'allgpus' above can be one of the following:
- normal: for one-GPU job (the default QoS).
- short: for 1 or more small GPU jobs - possible for testing batch submission or interactive jobs.
- allgpus: for 1 or more large GPU jobs.
- 1gpulong: for 1 large one-GPU job.
- admintest: special QoS only usable by administrators for full node testing.
- 1cpu: assigned to inactive student users after each semester (no GPU).
- deadline: a QoS where users with a hard publication deadline can apply for access to. To get access, please follow this guide.
Besides this, jobs in smaller QoS groups in general have a higher priority, such that they will tend to be allocated before jobs with a lower priority. Jobs submitted in the 'deadline' QoS have the highest priority.
Getting your (Singularity) environment up
It is possible to run everything from a single line. First pull a docker image and convert it to a singularity image
srun singularity pull docker://nvcr.io/nvidia/cuda:latest
This may take some time. There is now an image called 'cuda_latest.sif' that we can make use of. The address 'docker://nvcr.io/nvidia/cuda:latest' identifies where to retrieve the image on NVIDIA GPU Cloud (NGC) - more about this further down.
Notice that we used both Slurm (srun
) and Singularity (singularity
) above to retrieve an image. srun
is for executing the actual job in AI Cloud. singularity
must be executed on one of the DGX-2 compute nodes, which srun
takes care of. You cannot execute singularity
directly on the front-end node. singularity
here retrieves the specified (Docker) image from NGC and automatically converts it to a Singularity image.
To execute a command at the compute node with certain specified resources, we can do
$ srun --gres=gpu:1 singularity exec --nv docker://nvcr.io/nvidia/cuda:latest nvidia-smi
srun: job 264 queued and waiting for resources
srun: job 264 has been allocated resources
WARNING: group: unknown groupid 140195
Mon Apr 15 12:54:43 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 410.104 Driver Version: 410.104 CUDA Version: 10.0 |
|-------------------------------|----------------------|----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM3... Off | 00000000:34:00.0 Off | 0 |
| N/A 33C P0 52W / 300W | 0MiB / 32480MiB | 0% Default |
+-------------------------------|----------------------|----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
Here --gres=gpu:1
states that we are want to allocate one GPU. The --nv
argument states that we are using the nvidia runtime environment.
A few words on container images
In the previous section, we saw how to retrieve an image from NGC and how to instantiate that image as a Singularity container and execute a command in it. The image was retrieved from NVIDIA GPU Cloud (NGC). NGC is NVIDIA's official repository with many different images of useful software environments for deep learning and other GPU-accelerated tasks. The images in NGC have been specially built for NVIDIA GPUs and can be very convienient to use out-of-the-box instead of having to configure your software environment from scratch yourself.
NGC's container images are Docker images, but Singularity can convert them on the fly to run as Singularity containers. Docker itself is not used in AI Cloud due to secutiry concerns in multi-user environments. You can also use images from other Docker repositories such as Docker Hub. Remember to look for images that are built with support for NVIDIA GPUs.
You can read more here about how to Build images from scratch or modify images from NGC.
Once images are built, they are immutable meaning that you cannot
install additional software inside the images themselves at container
runtime. Sometimes, especially when running Python software, it can be
convenient to install additional packages into your runtime
environment using pip
. This can cause special challenges which you
can read more about and see how to solve in Installing Python
packages with pip in Singularity
containers. See
also
Troubleshooting.
Where to save your files
The front-end node and compute nodes have access to a distributed file system in '/user' based on NFS. To see this, first create a file on the front-end node in your home directory
echo "I have just created this" > example.txt
This is available on the front-end node. But you can also "see" this on a compute node
srun cat example.txt
Using the scratch space
Each compute node has a scratch space for storing temporary data. It is a RAID0 NVME (SSD) partition with high disk I/O capacity. Following are two ways you can use it for your jobs:
Via interactive bash session:
Since the RAID scratch space is local to each compute node, you need to specify the exact node you want to use with the --nodelist
argument as below:
srun --pty --nodelist=nv-ai-01.srv.aau.dk bash -l
srun: job 575 has been allocated resources
vle@its.aau.dk@nv-ai-01:~$ ls /raid/
vle@its.aau.dk@nv-ai-01:~$ mkdir -p /raid/its.vle # create a folder to hold your data. It's a good idea to use this path pattern: /raid/<subdomain>.<username>.
vle@its.aau.dk@nv-ai-01:~$ cp -a /user/its.aau.dk/vle/testdata /raid/its.vle/
vle@its.aau.dk@nv-ai-01:~$ exit # quit interactive session.
After the data has been copied to the scratch folder, you can use it by referring to the data in your code. For example:
srun --pty --nodelist=nv-ai-01.srv.aau.dk ls /raid/its.vle/testdata
Via sbatch job script
You can script the whole chain of commands into an sbatch
script
#!/usr/bin/env bash
#SBATCH --job-name MySlurmJob # CHANGE this to a name of your choice
#SBATCH --partition batch # equivalent to PBS batch
#SBATCH --time 24:00:00 # Run 24 hours
#SBATCH --qos=normal # possible values: short, normal, allgpus, 1gpulong
#SBATCH --gres=gpu:1 # CHANGE this if you need more or less GPUs
#SBATCH --nodelist=nv-ai-01.srv.aau.dk # CHANGE this to nodename of your choice. Currently only two possible nodes are available: nv-ai-01.srv.aau.dk, nv-ai-03.srv.aau.dk
##SBATCH --dependency=aftercorr:498 # More info slurm head node: `man --pager='less -p \--dependency' sbatch`
## Preparation
mkdir -p /raid/its.vle # create a folder to hold your data. It's a good idea to use this path pattern: /raid/<subdomain>.<username>.
if [ !-d /raid/its.vle/testdata ]; then
# Wrap this copy command inside the if condition so that we copy data only if the target folder doesn't exist
cp -a /user/its.aau.dk/vle/testdata /raid/its.vle/
fi
## Run actual analysis
## The benefit with using multiple srun commands is that this creates sub-jobs for your sbatch script and be uded for advanced usage with SLURM (e.g. create checkpoints, recovery, ect)
srun python /path/to/my/python/script --arg1 --arg2
srun echo finish analysis
A script such as the above can be submitted to the Slurm queue using the sbatch
command
sbatch <name-of-script>
Transferring files
You can transfer files to/from AI Cloud using the command line utility scp
from your local computer (Linux and OS X). To AI Cloud:
$ scp some-file <USER ID>@ai-pilot.srv.aau.dk:~
where '~' is your user folder on AI Cloud. <USER ID>
could for example be ‘ab34ef@department.aau.dk’.
You can append folders below that to your destination:
$ scp some-file <USER ID>@ai-pilot.srv.aau.dk:~/some-folder/some-subfolder/
From AI Cloud:
$ scp <USER ID>@ai-pilot.srv.aau.dk:~/some-file some-local-folder
In general, file transfer tools that can use SSH as protocol should work. A common choice is FileZilla or the Windows option WinSCP.
If you wish to mount a folder in AI Cloud on your local computer for easier access, you can also do this using sshfs
(Linux command line example executed on your local computer):
mkdir aicloud-home
sshfs <USER ID>@ai-pilot.srv.aau.dk:/user/<DOMAIN>/<ID> aicloud-home
where <DOMAIN>
is 'department.aau.dk' and <ID>
is 'ab34ef' for user 'ab34ef@department.aau.dk'.
Ways of using Slurm
Just to summarise from the above examples, there are three typical was of executing work through Slurm:
- Allocate resources: use
salloc
(see Slurm allocate resources. This reserves the requested resources for you which you can then use when they are ready. This means you can for example log into the specified compute node(s) to use it/them interactively (you cannot do this without having allocated resources on them). - Run command directly through Slurm: use
srun
(see Getting your (Singularity) environment up). This runs the specified command directly on the requested resources as soon as they are ready. Thesrun
command will block until done. - Schedule one or more jobs to run whenever ressources become ready: use
sbatch
(example in Via sbatch job script). This command lets you specify the details of your job in a script which you submit to the queue viasbatch
. This is convenient if for example you have a large job consisting of many steps or many jobs that you want to specify at once and just leave it to Slurm to get the work done when resources are ready.
This is the most convenient way to run jobs once you know in advance what you need done. It allows you to specify even very large and complicated amounts of work here and now and then just leave it to Slurm to get things run as soon as resources are available. This way, you will not have to sit around and wait for it. Thesbatch
command returns immediately and you can then usesqueue
to inspect where your jobs are in the queue.
Too many open files
Theres a limit on the number of open files you can have the compute nodes. These limits on the compute nodes are drawn from the limits on the login node. If you need to work with many files, you might need to increase the default value. Too see this and alter the default limit you can do
nv-ai-fe01:~$ ulimit -n
4096
nv-ai-fe01:~$ srun --pty bash -c 'ulimit -n'
4096
nv-ai-fe01:~$ ulimit -n 16384
nv-ai-fe01:~$ srun --pty bash -c 'ulimit -n'
16384
Examples
Interactive TensorFlow
First we pull a TensorFlow image
srun singularity pull docker://nvcr.io/nvidia/tensorflow:19.03-py3
The pull address of the container can be found from the NGC catalog.
We can then do
srun --gres=gpu:1 --pty singularity shell --nv tensorflow_19.03-py3.sif
or by reference to the docker image 'docker://nvcr.io/nvidia/tensorflow:19.03-py3'
You now have shell access
Singularity tensorflow_19.03-py3.sif:~>
Documentation and examples are available in '/workspace/'. You can exit the interactive session with
exit
Inspecting your utilization
It is recommended practice, after you have configured your environment/pipeline, that you inspect your GPU utilization [%], and possibly memory utilization, to see if you
- indeed are utilizing the GPU as expected.
- achieve a somewhat acceptable level of GPU utilization.
You can do this with the nvidia-smi
command, by executing e.g. the following in your environment. First, get an interactive resource:
srun --gres=gpu:1 --pty singularity shell --nv myimage.sif
Start observing with nvidia-smi:
nvidia-smi -query-gpu=index,timestamp,utilization.gpu,utilization.memory,memory.total,memory.used,memory.free --format=csv -l 5 > util.csv &
then start your (small) code in the background using e.g.
python .....
Afterwards have a look at the reported utilization
Singularity> cat util.csv
index, timestamp, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.used [MiB], memory.free [MiB]
0, 2020/11/12 13:48:11.412, 99 %, 52 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 13:48:16.415, 99 %, 90 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 13:48:21.417, 99 %, 51 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 13:48:26.419, 98 %, 61 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 13:48:31.424, 100 %, 88 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 13:48:36.427, 32 %, 33 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 13:48:41.428, 98 %, 70 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 13:48:46.435, 100 %, 91 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 13:48:51.437, 99 %, 48 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 13:48:56.440, 100 %, 91 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 13:49:01.441, 99 %, 38 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 13:49:06.443, 97 %, 63 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 13:49:11.444, 35 %, 36 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 13:49:16.446, 98 %, 72 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 13:49:21.447, 100 %, 88 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 13:49:26.449, 98 %, 68 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 13:49:31.451, 98 %, 68 %, 32480 MiB, 30994 MiB, 1486 MiB
Now we are certain that the code and software is set up to utilize the GPU.
You can also do this by the script getUtilByJobId
after job submission.
nv-ai-fe01:~$ getUtilByJobId.sh 83549
To end, do a CRTL-C
utilization.gpu
Percent of time over the past sample period during which one or more kernels was executing on the GPU.
The sample period may be between 1 second and 1/6 second depending on the product.
utilization.memory
Percent of time over the past sample period during which global device memory was being read or written.
The sample period may be between 1 second and 1/6 second depending on the product.
memory.total
Total installed GPU memory.
memory.used
Total memory allocated by active contexts.
memory.free
Total free memory.
salloc: Granted job allocation 84172
salloc: Waiting for resource configuration
salloc: Nodes nv-ai-03.srv.aau.dk are ready for job
tlj@its.aau.dk@nv-ai-03.srv.aau.dk's password:
index, timestamp, utilization.gpu [%], utilization.memory [%], memory.total [MiB], memory.used [MiB], memory.free [MiB]
0, 2020/11/12 14:02:14.854, 98 %, 66 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 14:02:19.858, 0 %, 0 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 14:02:24.861, 48 %, 42 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 14:02:29.863, 99 %, 72 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 14:02:34.864, 72 %, 57 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 14:02:39.868, 100 %, 68 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 14:02:44.871, 99 %, 66 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 14:02:49.872, 99 %, 66 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 14:02:54.875, 99 %, 68 %, 32480 MiB, 30994 MiB, 1486 MiB
0, 2020/11/12 14:02:59.884, 11 %, 7 %, 32480 MiB, 30994 MiB, 1486 MiB
^Csalloc: Relinquishing job allocation 84172
where 83549 is the Slurm JobId. If your job does not behave as intended, then please analyse the problem and try to solve the issue. It is not good practice to have allocated but not utilizing GPUs on the system. If you cannot find the issue, please email support@its.aau.dk with as many details as possible, preferably with a minimal working example.
TensorFlow and Keras example with own custom setup
In this example we will set everything up on the compute node by checking out a Git repository. Notice that some of the steps might not need to be executed for every run when e.g. code, software or data is available.
Clone the repository locally
git clone https://git.its.aau.dk/CLAAUDIA/docs_aicloud.git
Change directory
cd docs_aicloud/aicloud_slurm/tensorflow_keras_example/
and build our image (see file Singularity: there iss some Python + TensorFlow + Keras).
srun singularity build --fakeroot tensorflow_keras.sif Singularity
This may take some time. Notice that when we build based on our own recipe file we need to add --fakeroot
, see more here
We will also create an 'output_data' dir
mkdir output_data
We can now run. Here we map the local directory as '/code' inside the Singularity container and then execute the Python program example.py with ten epocs
srun --gres=gpu:1 singularity exec --nv -B .:/code -B output_data:/output_data tensorflow_keras.sif python /code/example.py 10
Results are now available in 'output_data'.
TensorFlow and Jupyter notebook using TensorFlow container
We have provided a script for this. You can do the following and type in a password
$ /user/share/scripts/jupyter.sh
Using tensorflow.sif
srun: job 45060 queued and waiting for resources
srun: job 45060 has been allocated resources
slurmstepd: task_p_pre_launch: Using sched_affinity for tasks
Enter password:
Verify password:
[NotebookPasswordApp] Wrote hashed password to /user/its.aau.dk/tlj/.jupyter/jupyter_notebook_config.json
LIST
Point your browser to http://nv-ai-01.srv.aau.dk:8888
Press any key to close your jupyter server
The first time, this will download a new TensorFlow image. Then follow the guide printed in your terminal window on how to open the Jupyter notebook in a browser and type in your password. You have to be on the AAU network (on campus or VPN). Try e.g. new->Python 3 and execute the following cell.
!nvidia-smi
or
from tensorflow.python.client import device_lib
device_lib.list_local_devices()
You should see you have one V100 GPU available. You can close again by pressing any key in the terminal, or cancel the slurm allocation.
PyTorch and Anaconda
Some images, like the PyTorch images from NGC, come with Anaconda, which is a widely used Python distribution. In this example we will build a PyTorch image and install additional Anaconda packages in the image.
First build our Singularity image from a Docker PyTorch image and install additional conda packages. The Singularity file is
BootStrap: docker
From: nvcr.io/nvidia/pytorch:20.03-py3
%post
/opt/conda/bin/conda install -c anaconda beautifulsoup4
Go into the folder 'docs_aicloud/aicloud_slurm/pytorch_anaconda_example' and build using
srun singularity build --fakeroot pytorch.sif Singularity
Again this may take some time. Notice that we pull the PyTorch Docker image from NGC
Next we can, e.g., run our container in interactive mode
srun --pty --gres=gpu:1 singularity shell --nv pytorch.sif
and we can then use the Anaconda Python distribution to for example run IPython
$ ipython
Python 3.6.9 |Anaconda, Inc.| (default, Jul 30 2019, 19:07:31)
Type 'copyright', 'credits' or 'license' for more information
IPython 7.12.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import torch
In [2]: import bs4
PyTorch and multi-precision training
The NVIDIA Tesla V100 comes with specialized hardware for tensor operations called tensor cores. For the V100, the tensor cores work on integers or half-precision floats and the default in many DNN frameworks is single precision. Additional changes to the code are then necessary to activate half-precision models in cuDNN and utilize the tensor core hardware for:
- Faster execution
- Lower memory footprint that allows for an increased batch size.
An example on how to adapt your PyTorch code is provided here. The example uses APEX automatic multi precision AMP and native Torch AMP available in NGC from version 20.06.
Multi-GPU data parallelism training with Horovod and Keras
The NVIDIA DGX-2 comes with specialized hardware for moving data between GPUs: NVLinks and NVSwitches. One approach to utilizing these links is using the MVDIA Collective Communication Library (NCCL). NCCL is compatible with the Message Passing Interface (MPI) used in many HPC applications and facilities. This in turn is build into the Horovod framework for data parallelism training supporting many deep learning frameworks requiring only minor changes in the source code. In this example we show how to run Horovod on our system, including Slurm settings. You can then adapt this example for you preferred framework as described in the Horovod documentation
TensorFlow 2 with Spyder GUI
It is possible to start a GUI in the Singularity container and show graphical elements. In this example we will start the IDE Spyder using X11 forwarding. First connect to the AI Cloud with X11 forwarding enabled
ssh <aau ID>@ai-pilot.srv.aau.dk -X
From the Git repository hosting this documentation, go to the folder 'aicloud_slurm/spyder_ide/' with the file 'tensorflow_spyder.def' containing:
BootStrap: docker
From: nvcr.io/nvidia/tensorflow:21.06-tf2-py3
%post
apt-get update
DEBIAN_FRONTEND=noninteractive apt-get -yq install xorg x11-apps libxkbcommon-x11-0 alsa
pip install pandas
pip install spyder
Build the image by running:
srun --cpus-per-task=6 singularity build --fakeroot tensorflow_spyder.sif tensorflow_spyder.def
Execute with (notice the --x11
flag to enable X11 forwarding via the Slurm scheduler)
srun --gres=gpu:1 --x11 singularity exec --nv tensorflow_spyder.sif spyder
You should now see the Spyder GUI (it may take a few seconds to appear depending on your connection bandwidth).
PyTorch with Spyder GUI
Similar to the TensorFlow example above, we can also install and run Spyder in a container based on PyTorch. From the same starting point as the TensorFlow example above, use the file 'pytorch_spyder.def' containing:
Bootstrap: docker
From: nvcr.io/nvidia/pytorch:21.06-py3
%post
export DEBIAN_FRONTEND=noninteractive
apt-get update
apt-get install -y libgl1 python3-pyqt5.qtwebkit
/opt/conda/bin/conda install -y spyder
unset DEBIAN_FRONTEND
Build the image by running:
srun --cpus-per-task=6 singularity build --fakeroot pytorch_spyder.sif pytorch_spyder.def
Execute with (notice the --x11
flag to enable X11 forwarding via the Slurm scheduler)
srun --gres=gpu:1 --x11 singularity exec --nv pytorch_spyder.sif spyder
If you need other versions of the PyTorch base image (such as 20.11 in the example below), this version of the recipe may be more robust, but it provides an older version of Spyder:
Bootstrap: docker
From: nvcr.io/nvidia/pytorch:20.11-py3
%post
export DEBIAN_FRONTEND=noninteractive
apt-get update
apt-get install -y libgl1 spyder3
unset DEBIAN_FRONTEND
For this alternative version, build the image as above but execute as follows:
srun --gres=gpu:1 --x11 singularity exec --nv -B /run/user/`id -u` pytorch_spyder.sif spyder3
Matlab
It is possible to run Matlab both with and without GUI
First, build your Matlab image, e.g. like
srun --cpus-per-task=6 singularity build matlab.sif docker://nvcr.io/partners/matlab:r2019b
Then we need to set an environment variable such that matlab knows your license. In this case its convenient to point to the AAU license server
export MLM_LICENSE_FILE=27000@matlab.srv.aau.dk
Now you can start Matlab with pure command line as
srun --pty --gres=gpu:1 singularity exec --nv matlab.sif matlab -nodesktop
or with GUI (if your SSH connection has X11 forwarding enabled)
srun --pty --x11 --gres=gpu:1 singularity exec --nv matlab.sif matlab
Priority
In August 2020 the queueing algorithm was changed. In the following we would like to describe this algorithm such that it is clear in which order jobs are being allocated resources.
The method is a multifactor method - a higher priority means the job will be executed before jobs with a lower priority. You can read an introduction here. The main motivation was to allow people with an upcoming deadline higher priority - secondarily to encourage smaller jobs, both of which are difficult with the previously used scheduling method: first-in-first-out (FIFO) with backfill. The key factors we use in this setup are:
- Age (priority is increased as the job sits in the queue)
- Fairshare (priority is increased for users, or parts of the organization with a low resource utilization)
- Jobsize (priority is increased for users requesting fewer CPUs)
- QoS (priority is increased for users requesting resources in more resource-strict QoS, except 1CPU)
The total priority is computed as a weighted sum of factors (normalized [0,1]) times the weight of each factor. For our setup, this is
Job_priority =
(PriorityWeightAge) * (age_factor) +
(PriorityWeightFairshare) * (fair-share_factor) +
(PriorityWeightPartition) * (partition_factor) +
(PriorityWeightQOS) * (QOS_factor) +
1000
The term 1000 is a constant partition factor, so for now it can be ignored.
In the following we will show the weights and calculations for each step.
You can see the weights by:
$ sprio -w
JOBID PARTITION PRIORITY SITE AGE FAIRSHARE JOBSIZE PARTITION QOS
Weights 1 2000 4000 2000 1000 5000
The Age is normalized to max out after 14 days
$ scontrol show config | grep PriorityMaxAge
PriorityMaxAge = 14-00:00:00
after which it will have a factor of 1. The factor only increases if the job is eligible (within constraints and not waiting for dependencies).
The job size factor is from 0.5 (priority 1000) from a user requesting all CPUs (96 on a single node), or 0.25 (500) for 2*96=192 CPUs, to 0.75 (priority 1500) for "0" CPUs, or a priority contribution of 1490 for a single CPU.
The fair share factor is based on "classic fair share". Computations are little complicated. But you can see your current share factor by running
$ sshare
The QoS factor is obtained from
$ sacctmgr show qos format=name,maxtresperuser%20,maxwalldurationperjob,priority
Name MaxTRESPU MaxWall Priority
---------- -------------------- ----------- ----------
normal cpu=20,gres/gpu=1 2-00:00:00 20
short cpu=32,gres/gpu=4 03:00:00 500
allgpus cpu=48,gres/gpu=8 21-00:00:00 0
1gpulong cpu=16,gres/gpu=1 14-00:00:00 10
admintest cpu=96,gres/gpu=16 1-00:00:00 10
1cpu cpu=2 06:00:00 500
deadline cpu=64,gres/gpu=8 14-00:00:00 5000
and then normalized by the highest QoS (5000). In order to calculate the contribution to the priority, you need take the QoS priority and multiply by 5000/5000 = 1. To read more and obtain access to the high-priority deadline QoS, see section Do you have an upcoming deadline?
You can see the priorities of the queue by running
$ sprio
or sorted by priority
$ sprio -S -y
including the contributions from different factors and their weights.
Fair usage
The following guidelines are put in place to have a fair usage of the system for all users. The following text might be updated from time to time such that we can better serve all users.
ITS/CLAAUDIA work from the following principles for fair usage:
- Good research is the success criterion and ITS/CLAAUDIA should lower the barrier for allowing this.
- Researchers should enter on a level playing field.
- ITS has an administrative and technical role and should in general not determine what research should have a higher priority. Students are vetted with recommendation of supervisor/staff backing that this is for research purposes.
- Aim at the most open and unrestricted access model.
Based on these principles we kindly ask that all users consider the following guidelines:
- Please be mindful of your allocations and refrain from allocating many resources without knowing/testing/verifying that you indeed can make good usage of the allocated resources.
- Please be mindful and de-allocate the resources if you do no use them. Then other users can make good use of these.
If in doubt, you can do
squeue -u $USER
and inspect you own allocations. If you have allocations you are not using, then please cancel these resource allocations.
A few key points to remember:
- Please refrain from doing pre-emptive allocations. From the current load, we still conclude that there are enough resources if the resources are used wisely.
- There are resources available in the evenings/nights and weekends. If possible, start your job as a batch script (
sbatch
), and let it queue and rest while the computer does the work. Maybe even better, put the job to queue late in the afternoon or use-b
,--begin
option with your batch script, e.g. add the line
#SBATCH --begin=18:00:00
ITS/CLAAUDIA will keep analysing and observing the usage of the system to make the best use of the available resources based on the above principles and guidelines. If ITS/CLAAUDIA is in doubt, we will contact users and ask if the resource allocations are in line with the above principles and guidelines. We have previously contacted users in this regard, and will be more active in periods of high utilization.
Do you have an upcoming deadline?
If you are working towards an upcoming deadline, and find it difficult to have the resources you need, then please send an email to support@its.aau.dk with a URL (call for papers etc.) stating the deadline. We can provide some hints, help and possibly additional resources to help you meet your deadline.
Training
Twice a year (~April and November) we do a 2h training session covering Slurm and Singularity. You can obtain the material here. We have a recording of such a training session that you can obtain upon request by sending an email to support@its.aau.dk.
Data deletion
From time to time we observe if users no longer are listed in the central database. Users can be removed when the studies or employment ends. We will as a last option try to reach you by email. If this fails, we reserve the right to delete data in your home directory and on the compute node, e.g. /raid.
Additional resources
Docker and singularity on an HPC system