Ready for Tuesday. Hurray

Tobias Lindstrøm Jensen 3 years ago
parent b26a4b719c
commit f9acfc81cc
  1. BIN
  2. 310
  3. BIN
  4. 12

Binary file not shown.


Width:  |  Height:  |  Size: 50 KiB


Width:  |  Height:  |  Size: 50 KiB

@ -1,29 +1,56 @@
# Essential skill set and tool set
# Backgroud
- Basic Linux, and Shell environment, preferably bash scripting language
- Terminal:
- Windows: MobaXterm (
- MacOS default terminal or iTerm2 (
- Linux: Gnome terminal, KDE konsole
- Shell: bash (default), zsh (more feature-rich)
- For research purposes at AAU.
- Allow students based on recommendation from staff/supervisor/researcher.
- Free---but we do observe accounting.
- Currently two DGX-2 (2 peta flops per node)
![AI Cloud Design](../images/dgx2.jpg){width=50%}
## Background II
- Two DGX-2 in the AI Cloud cluster
- Shared. We try to protect data, but we have no certificate on data protection.
- One DGX-2 set a side for research with senstive data.
- Slised. There is project, and others are coming with requiments on data protection.
- A lot of things are hapening both in DK and EU level. HPC landscape is being reshaped. If you need something, then email for more information.
# System design
## High level workflow
## High level design
![AI Cloud Design](./images/AICloudDesign.png){width=95%}
![AI Cloud Design](../images/AICloudDesign.png){width=95%}
## Resource management
- What to manage: walltime, number of GPUs
- Levels of management in slurm's terms:
- organization
- account
- quality of service (QoS)
- fairshare
- What to manage: walltime, number of GPUs, number of CPUs, memory, ...
- Levels of management in slurm terms:
- Account and organization: cs, es, es.shj
- Quality of service (QoS): normal, 1gpulong, ...
- Queuing algorithm: FIFO with backfill
## Why?
- Why SLURM?
- Resource management.
- Transparency, fairness
- Widely used. Used before at AAU
- General introduction about [slurm](
- Why Singularity?
- Users can decide which software they want in their environment.
- Draw on the Docker images NVIDIA supplies in their [NGC](
- Can convert Docker to Singularity image
- Some issues with Docker on such as system
# Getting started
## Essential skill set and tool set
- Basic Linux, and Shell environment, preferably bash scripting language
- Terminal:
- Windows: MobaXterm (
- MacOS default terminal or iTerm2 (
- Linux: Gnome terminal, KDE konsole
- Shell: bash (default), zsh (more feature-rich)
## Log on the server
- Inside AAU network (on campus or inside VPN):
@ -32,19 +59,17 @@
- Outside AAU network (external users or outside VPN)
# Two-step log on
ssh -l <username>
ssh -l <username>
# Tunneling
ssh -L -l
scp -P 2022 ~/Download/testfile
# Two-step log on
ssh -l <username>
ssh -l <username>
# Tunneling
ssh -L \
scp -P 2022 ~/Download/testfile
<span style="color:blue;latex-color:blue">Demo</span>: Login
## Optional but recommended: Byobu (Tmux's User-friendly Wrapper for Ubuntu)
## Optional: Byobu (Tmux's User-friendly Wrapper for Ubuntu)
- Benefits:
- Disconnect from the server while your programs are running
@ -56,153 +81,110 @@
# Slurm basics
## Why slurm ?
General introduction about [slurm](
### Resource management: how many and which GPUs ?
### Queue system: Who or which job get run first ?
## Essential commands
- `sbatch` -- Submit job script (Batch mode)
- `salloc` -- Create job allocation and start a shell (Interactive Batch mode)
- `srun` Run a command within a batch allocation that was created by sbatch or salloc
- `scancel` -- Delete jobs from the queue
- `squeue` -- View the status of jobs
- `sinfo` View information on nodes and queues
- `sbatch` -- Submit job script (batch mode)
- `salloc` -- Create job allocation and start a shell (one option for interactive mode)
- `srun` -- Run a command within a batch allocation that was created by sbatch or salloc (or allocates if sbatch or salloc was not used)
- `scancel` -- Cancel submitted/running jobs
- `squeue` -- View the status of the queue
- `sinfo` -- View information on nodes and queues
- `scontrol` -- view (or modify) state, e.g. jobs
## Interactive jobs
- `srun --pty --time=<hh:mm:ss> --gres=gpu:1 bash -l`
- `salloc --time=<hh:mm:ss> --gres=gpu:1`
- `ssh <email>`
- `srun --time=<hh:mm:ss> --gres=gpu:1 bash -l`
- `ssh`
<span style="color:blue;latex-color:blue">Demo</span>: Interactive job
## Slurm Batch job script
#!/usr/bin/env bash
#SBATCH --job-name MySlurmJob
#SBATCH --partition batch # equivalent to PBS batch
#SBATCH --dependency=aftercorr:498 # More info slurm head node: `man --pager='less -p \--dependency' sbatch`
#SBATCH --time 24:00:00 # Run 24 hours
#SBATCH --gres=gpu:1
## Slurm Batch job script (cont'd)
#SBATCH --qos=normal # possible values: short, normal
##SBATCH --gres=gpu:1 # commented out
#SBATCH --qos=normal # examples short, normal, 1gpulong, allgpus
srun echo hello world from sbatch
- submit job:
Submit job:
<span style="color:blue;latex-color:blue">Demo</span>: Submit batch script and check job status
## Control job status
scancel - singal/cancel jobs or job steps
`scancel --user="" --state=pending`
<span style="color:blue;latex-color:blue">Demo</span>: Cancel jobs
## Looking up things and cancelling
## Looking up things
scontrol -d show job <JOBID>
sacctmgr show qos \
sinfo -p batch
sinfo --Node
sinfo -o "%D %e %E %r %T %z" -p batch
squeue: squeue -u $USER -i60 # query every 60 seconds
scontrol show job <JOBID>
scontrol show partition <patitionName>
sacctmgr show qos format=name,priority,maxtresperuser,MaxWall
<span style="color:blue;latex-color:blue">Demo</span>: slurm query commands
Cancelling a job or jobstep: ```scancel```
```scancel <jobid>```
## Accounting commands
- `sacct` - report accounting information by individual job and job step
- `sstat` - report accounting information about currently running jobs and job steps
- `sreport` - report resources usage by cluster, partition, user, account, etc
- `sprio` - view factors comprising a job priority
sacct -A claaudia -u
sreport cluster AccountUtilizationByUser cluster=ngc \
account=claaudia start=4/01/19 end=4/17/20 \
account=cs start=5/21/20 end=5/28/20
<span style="color:blue;latex-color:blue">Demo</span>: slurm accounting commands
# Slurm: Hints for more advanced uses
## Slurm: Hints for more advanced uses
Important readings:
Some additional readings:
- [Multifactor Priority Plugin](
- [Trackable Resource](
- [Accounting](
- [Resource Limit](
- [Dependencies](
- [Job array](
- run `man <commandname>` for builtin documentation. For example: `man scontrol`
## Slurm job management commands
scontrol write batch_script job_id optional_filename
# write batch_script doesn't work on CLAAUDIA AI cloud yet
scontrol update qos=short jobid=525
## Slurm query commands
- `sacctmngr` - database management tool
- `sshare` - view current hierarchical fair-share information
- `sdiag` - view statistics about scheduling module operations
## sacctmngr
`sacctmgr` - database management tool
sacctmgr show assoc \
sacctmgr show assoc format=account,user,qos%30
sudo sacctmgr modify QOS \
normal set MaxTRESPerUser=gres/gpu=2
## strigger - event trigger management tools
# execute the program "/home/joe/clean_up"
# when job 1234 is within 10 minutes of reaching its time limit.
strigger --set --jobid=1234 --time --offset=-600 \
# Singularity basics
## Why singularity
## Why singularity?
To overcome Docker's drawbacks while still work well with Docker
1. Security
1. You get your own environment.
- Flexibility in software
- Flexibility in version
- User requests/changes does not effect others
2. Security
- root access
- resource exposure
2. Compatibility with `slurm`
3. Compatibility with `slurm`
- resource policy
3. Simplicity
4. HPC-geared
4. Simplicity
5. HPC-geared
## Check built-in documentation
@ -210,46 +192,94 @@ To overcome Docker's drawbacks while still work well with Docker
see `singularity help <command>`
## Singularity build command
## Singularity build from Docker and exec command
![Build IO](./images/build_input_output.png "Build Input Output"){width=50%}
`sudo singularity build lolcow.sif docker://godlovedc/lolcow`
Example: Pull a Docker image and convert to singularity image
# Common Use cases
`srun singularity pull docker://godlovedc/lolcow`
## Run stock docker images (not recommended)
srun --gres=gpu:2 singularity run --nv -B $HOME/data:/data \
docker:// nvidia-smi
and then run
`srun singularity run lolcow.sif`
## Singularity build from NGC Docker image
## Build singularity image from Docker
```srun singularity pull docker://```
Common use case for interactive work:
srun --pty --gres=gpu:1 \
singularity shell --nv tensorflow_20.03-tf2-py3.sif
import tensorflow
sudo singularity build lolcow.sif docker://godlovedc/lolcow
With the last exit you released the resources. Keep multiple connections, tmux, screen, or byobo to avoid releasing.
## Combining all the steps from today
srun --gres=gpu:1 singularity exec --nv \
-B .:/code -B mnist-data/:/data -B output_data:/output_data \
tensorflow_20.03-tf2-py3.sif python /code/ 10
## Build and run customized Singularity images
## Build a customized Singularity images
Singularity [ definition file ](
<span style="color:blue;latex-color:blue">Demo</span>: Write and build singularity image from a definition file
Example: Singularity
BootStrap: docker
## Singularity: run, exec, inspect, shell
pip install keras
You can then build with
srun --gres=gpu:1 singularity run --nv \
srun singularity build --fakeroot \
tensorflow_keras.sif Singularity
srun --pty --gres=gpu:1 singularity shell --nv \
--home /tmp --contain --bind \
/user shub://vsoch/singularity-images
## Running your customized Singularity images
sudo singularity build sbuild.img \
You can the run with
singularity inspect sbuild.img
srun --gres=gpu:1 singularity exec --nv \
-B .:/code -B output_data:/output_data
\ tensorflow_keras.sif python /code/ 10
or enter an interactive session with
<span style="color:blue;latex-color:blue">Demo</span>: run, exec, inspect, shell
srun --pty --gres=gpu:1 singularity shell --nv \
-B .:/code -B output_data:/output_data
\ tensorflow_keras.sif
# Where to go from here
- [The user documentation](
- More workflows
- Copying data to the local drive for higher I/O performance
- Inspecting your utilization
- Matlab, pytorch, ...
- Fair usage/upcoming deadline
- Links and references to additional material
- Support:
- Advisory:
- Use the resource and give feedback. Share with us your success stories (including benchmarks, solved challenges, new possibilities, etc.)
- Share with other users at the [Yammer channel](

@ -60,15 +60,3 @@ output:
## Outline
# Slide file, and contact info
AI cloud usage:
Regarding this presentation and related documents:
Singularity specific:!forum/singularity