Minor updates to training slides

master
Thomas Arildsen 1 year ago
parent fbc74efa2b
commit 7179d026d5
  1. 47
      aicloud_slurm/training/SlurmAndSingularityTraining.md
  2. BIN
      aicloud_slurm/training/SlurmAndSingularityTraining.pdf
  3. 6
      aicloud_slurm/training/singularity-doc-preamble.md

@ -3,17 +3,29 @@
## Background I
- For research purposes at AAU.
- Allow students based on recommendation from staff/supervisor/researcher.
- Free---but we do observe accounting.
- Admit students based on recommendation from staff/supervisor/researcher.
- Free---but the system does attempt to balance load evenly among departments.
## Background II
- Two DGX-2 in the AI Cloud cluster
- Shared. We try to protect data but some things are not put in place roughly [levels 0 and 1](https://www.security.aau.dk/dataclassification/)
- One DGX-2 set a side for research with confidential/sensitive (levels 2 and 3) data.
- Sliced (vitual machines). There are projects, and more are coming with requirements on data protection.
- GPU system. CPU primary computations should be done somewhere else. [Cloud: Strato](https://strato.claaudia.aau.dk) or [uCloud](https://cloud.sdu.dk), possibly [VMWare](https://www.en.its.aau.dk/instructions/VMware).
- A lot of things are happening both in [DK](https://www.nyheder.aau.dk/2020/nyhed/ny-dansk-supercomputer-skaber-langt-mere-samfundsvaerdi.cid489812) and at EU level. HPC landscape is being reshaped. If you need something, then email CLAAUDIA@aau.dk for more information.
- Two NVIDIA DGX-2 in the AI Cloud cluster
- Shared. Users' data separated by ordinary file system access
restrictions. Not suitable for sensitive/secret data. Usable for
[levels 0 and 1](https://www.security.aau.dk/dataclassification/)
- One DGX-2 set aside for research with confidential/sensitive (levels
2 and 3) data.
- Sliced (vitual machines). There are projects, and more are coming
with requirements on data protection.
- GPU system. CPU-primary computations should be done somewhere
else. [Cloud: Strato](https://strato-new.claaudia.aau.dk) or
[uCloud](https://cloud.sdu.dk), possibly
[VMWare](https://www.en.its.aau.dk/instructions/VMware).
- A lot of things are happening both in
[DK](https://www.deic.dk/da/Supercomputere/Nationale-HPC-anlog) and
at [EU
level](https://www.lumi-supercomputer.eu/the-first-phase-of-lumi-has-been-installed/). The
HPC landscape is being reshaped. If you need something, then email
CLAAUDIA@aau.dk for more information.
# System design
@ -25,10 +37,10 @@
## Essential skill set and tool set
- Basic Linux, and Shell environment, preferably bash scripting language
- Basic Linux, and shell environment, preferably bash scripting language
- Terminal:
- Windows: MobaXterm (https://mobaxterm.mobatek.net/)
- MacOS default terminal or iTerm2 (https://www.iterm2.com/)
- Windows: MobaXterm (<https://mobaxterm.mobatek.net/>)
- MacOS default terminal or iTerm2 (<https://www.iterm2.com/>)
- Linux: Gnome terminal, KDE konsole
- Shell: bash (default), zsh (more feature-rich)
@ -71,7 +83,7 @@ scp -P 2022 ~/Download/testfile <aau ID>@localhost:~/
# Slurm basics
## Why?
## Slurm queue manager
- Why Slurm?
- Resource management.
- Transparency, fairness
@ -161,7 +173,6 @@ sreport -tminper cluster utilization --tres="gres/gpu" \
sacctmgr show qos \
format=name,priority,maxtresperuser%20,MaxWall
sacctmgr show assoc format=account,user%30,qos%40
sudo sacctmgr modify user <user> set QOS+=deadline
```
Follow the guidelines on the documentation page and submit an email to support@its.aau.dk if you have a paper deadline.
@ -196,7 +207,7 @@ Some additional readings:
4. HPC-oriented
5. Users familar with Docker might experience slow build process.
Refs Docker vs. Singularity discussion: [ref](https://pythonspeed.com/articles/containers-filesystem-data-processing/) and [ref2](https://www.reddit.com/r/docker/comments/7y2yp2/why_is_singularity_used_as_opposed_to_docker_in/)
Refs Docker vs. Singularity discussion: [[1]](https://pythonspeed.com/articles/containers-filesystem-data-processing/) and [[2]](https://www.reddit.com/r/docker/comments/7y2yp2/why_is_singularity_used_as_opposed_to_docker_in/)
## Check built-in documentation
@ -206,7 +217,7 @@ Refs Docker vs. Singularity discussion: [ref](https://pythonspeed.com/articles/c
## Singularity build from Docker and exec command
Example: Pull a Docker image and convert to singularity image
Example: Pull a Docker image and convert to Singularity image
`srun singularity pull docker://godlovedc/lolcow`
@ -294,7 +305,7 @@ srun --pty --gres=gpu:1 singularity shell --nv \
On the node:
- View resource utilization on compute node (shh in):
- View resource utilization on compute node (ssh in):
* ```$ top -u <user>```
* ```$ smem -u -k```
* ```$ nvidia-smi -l 1 -i <IDX>``` # see scontrol -d show job <jobId>
@ -303,7 +314,7 @@ On the node:
- Data in e.g. /user/student.aau.dk/ are on a distributed file system
* Consider using /raid (SSD NVMe) on the compute node (see doc)
- If you have allocated a GPU and your job information contains ```mem=10000M``` and it is just pending (state=PD, possible reason=resources) but there should be resources.
* Issue: cancel and add e.g. --mem=64G to you allocation
* Issue: cancel and add e.g. `--mem=64G` to your allocation
## Tools, tips and tricks II
@ -338,7 +349,7 @@ We see challenges towards the end of semesters (cyclic):
- More workflows
- Copying data to the local drive for higher I/O performance
- Inspecting your utilization
- Matlab, pytorch, ...
- Matlab, PyTorch, ...
- Fair usage/upcoming deadline
- Links and references to additional material
- Support (fastest response): support@its.aau.dk

@ -1,7 +1,7 @@
---
title: Slurm and Singularity Training
subtitle: for AI cloud (Pilot Phase 2)
date: April 2021
title: Introduction to AI Cloud
subtitle: Slurm and Singularity Training
date: October 2021
theme: AAUsimple
# aspectratio: 169
header-includes: |

Loading…
Cancel
Save