Ready for Tuesday. Hurray

pull/4/head
Tobias Lindstrøm Jensen 3 years ago
parent b26a4b719c
commit f9acfc81cc
  1. BIN
      aicloud_slurm/images/AICloudDesign.png
  2. 310
      aicloud_slurm/training/SlurmAndSingularityTraining.md
  3. BIN
      aicloud_slurm/training/SlurmAndSingularityTraining.pdf
  4. 12
      aicloud_slurm/training/singularity-doc-preamble.md

Binary file not shown.

Before

Width:  |  Height:  |  Size: 50 KiB

After

Width:  |  Height:  |  Size: 50 KiB

@ -1,29 +1,56 @@
# Essential skill set and tool set
# Backgroud
- Basic Linux, and Shell environment, preferably bash scripting language
- Terminal:
- Windows: MobaXterm (https://mobaxterm.mobatek.net/)
- MacOS default terminal or iTerm2 (https://www.iterm2.com/)
- Linux: Gnome terminal, KDE konsole
- Shell: bash (default), zsh (more feature-rich)
- For research purposes at AAU.
- Allow students based on recommendation from staff/supervisor/researcher.
- Free---but we do observe accounting.
- Currently two DGX-2 (2 peta flops per node)
![AI Cloud Design](../images/dgx2.jpg){width=50%}
## Background II
- Two DGX-2 in the AI Cloud cluster
- Shared. We try to protect data, but we have no certificate on data protection.
- One DGX-2 set a side for research with senstive data.
- Slised. There is project, and others are coming with requiments on data protection.
- A lot of things are hapening both in DK and EU level. HPC landscape is being reshaped. If you need something, then email CLAAUDIA@aau.dk for more information.
# System design
## High level workflow
## High level design
![AI Cloud Design](./images/AICloudDesign.png){width=95%}
![AI Cloud Design](../images/AICloudDesign.png){width=95%}
## Resource management
- What to manage: walltime, number of GPUs
- Levels of management in slurm's terms:
- organization
- account
- quality of service (QoS)
- fairshare
- What to manage: walltime, number of GPUs, number of CPUs, memory, ...
- Levels of management in slurm terms:
- Account and organization: cs, es, es.shj
- Quality of service (QoS): normal, 1gpulong, ...
- Queuing algorithm: FIFO with backfill
## Why?
- Why SLURM?
- Resource management.
- Transparency, fairness
- Widely used. Used before at AAU
- General introduction about [slurm](https://www.youtube.com/watch?v=5nxMLqF6Eu8)
- Why Singularity?
- Users can decide which software they want in their environment.
- Draw on the Docker images NVIDIA supplies in their [NGC](https://ngc.nvidia.com/)
- Can convert Docker to Singularity image
- Some issues with Docker on such as system
# Getting started
## Essential skill set and tool set
- Basic Linux, and Shell environment, preferably bash scripting language
- Terminal:
- Windows: MobaXterm (https://mobaxterm.mobatek.net/)
- MacOS default terminal or iTerm2 (https://www.iterm2.com/)
- Linux: Gnome terminal, KDE konsole
- Shell: bash (default), zsh (more feature-rich)
## Log on the server
- Inside AAU network (on campus or inside VPN):
@ -32,19 +59,17 @@
- Outside AAU network (external users or outside VPN)
```
# Two-step log on
ssh sshgw.aau.dk -l <username>
ssh ai-pilot.srv.aau.dk -l <username>
# Tunneling
ssh -L 2022:ai-pilot.srv.aau.dk:22 -l vle@its.aau.dk sshgw.aau.dk
scp -P 2022 ~/Download/testfile vle@its.aau.dk@localhost:~/
```
```console
# Two-step log on
ssh sshgw.aau.dk -l <username>
ssh ai-pilot.srv.aau.dk -l <username>
# Tunneling
ssh -L 2022:ai-pilot.srv.aau.dk:22 \
-l vle@its.aau.dk sshgw.aau.dk
scp -P 2022 ~/Download/testfile vle@its.aau.dk@localhost:~/
```
<span style="color:blue;latex-color:blue">Demo</span>: Login
## Optional but recommended: Byobu (Tmux's User-friendly Wrapper for Ubuntu)
## Optional: Byobu (Tmux's User-friendly Wrapper for Ubuntu)
- Benefits:
- Disconnect from the server while your programs are running
@ -56,153 +81,110 @@
# Slurm basics
## Why slurm ?
General introduction about [slurm](https://www.youtube.com/watch?v=5nxMLqF6Eu8)
### Resource management: how many and which GPUs ?
### Queue system: Who or which job get run first ?
## Essential commands
- `sbatch` -- Submit job script (Batch mode)
- `salloc` -- Create job allocation and start a shell (Interactive Batch mode)
- `srun` Run a command within a batch allocation that was created by sbatch or salloc
- `scancel` -- Delete jobs from the queue
- `squeue` -- View the status of jobs
- `sinfo` View information on nodes and queues
- `sbatch` -- Submit job script (batch mode)
- `salloc` -- Create job allocation and start a shell (one option for interactive mode)
- `srun` -- Run a command within a batch allocation that was created by sbatch or salloc (or allocates if sbatch or salloc was not used)
- `scancel` -- Cancel submitted/running jobs
- `squeue` -- View the status of the queue
- `sinfo` -- View information on nodes and queues
- `scontrol` -- view (or modify) state, e.g. jobs
## Interactive jobs
- `srun --pty --time=<hh:mm:ss> --gres=gpu:1 bash -l`
or
- `salloc --time=<hh:mm:ss> --gres=gpu:1`
- `ssh <email>@nv-ai-03.srv.aau.dk`
- `srun --time=<hh:mm:ss> --gres=gpu:1 bash -l`
- `ssh nv-ai-03.srv.aau.dk`
<span style="color:blue;latex-color:blue">Demo</span>: Interactive job
## Slurm Batch job script
```bash
```console
#!/usr/bin/env bash
#SBATCH --job-name MySlurmJob
#SBATCH --partition batch # equivalent to PBS batch
#SBATCH --mail-type=ALL # NONE, BEGIN, END, FAIL, REQUEUE, ALL TIME_LIMIT, TIME_LIMIT_90, etc
#SBATCH --mail-user=vle@its.aau.dk
#SBATCH --dependency=aftercorr:498 # More info slurm head node: `man --pager='less -p \--dependency' sbatch`
#SBATCH --time 24:00:00 # Run 24 hours
#SBATCH --gres=gpu:1
```
## Slurm Batch job script (cont'd)
```bash
#SBATCH --qos=normal # possible values: short, normal
##SBATCH --gres=gpu:1 # commented out
#SBATCH --qos=normal # examples short, normal, 1gpulong, allgpus
srun echo hello world from sbatch
```
- submit job:
Submit job:
```console
sbatch jobscript.sh
```
<span style="color:blue;latex-color:blue">Demo</span>: Submit batch script and check job status
## Control job status
scancel - singal/cancel jobs or job steps
`scancel --user="vle@its.aau.dk" --state=pending`
<span style="color:blue;latex-color:blue">Demo</span>: Cancel jobs
## Looking up things and cancelling
## Looking up things
Basics:
```console
sinfo
squeue
scontrol -d show job <JOBID>
sacctmgr show qos \
format=name,priority,maxtresperuser%20,MaxWall
```
sinfo -p batch
sinfo --Node
sinfo -o "%D %e %E %r %T %z" -p batch
squeue: squeue -u $USER -i60 # query every 60 seconds
scontrol show job <JOBID>
scontrol show partition <patitionName>
sacctmgr show qos format=name,priority,maxtresperuser,MaxWall
```
<span style="color:blue;latex-color:blue">Demo</span>: slurm query commands
Cancelling a job or jobstep: ```scancel```
```scancel <jobid>```
## Accounting commands
- `sacct` - report accounting information by individual job and job step
- `sstat` - report accounting information about currently running jobs and job steps
- `sreport` - report resources usage by cluster, partition, user, account, etc
- `sprio` - view factors comprising a job priority
```
sacct -A claaudia -u vle@its.aau.dk
sreport cluster AccountUtilizationByUser cluster=ngc \
account=claaudia start=4/01/19 end=4/17/20 \
format=Accounts,Cluster,TresCount,Login,Proper,Used
account=cs start=5/21/20 end=5/28/20
\format=Accounts,Cluster,TresCount,Login,Proper,Used
```
<span style="color:blue;latex-color:blue">Demo</span>: slurm accounting commands
# Slurm: Hints for more advanced uses
## Slurm: Hints for more advanced uses
Important readings:
Some additional readings:
- [Multifactor Priority Plugin](https://slurm.schedmd.com/priority_multifactor.html)
- [Trackable Resource](https://slurm.schedmd.com/tres.html)
- [Accounting](https://slurm.schedmd.com/accounting.html)
- [Resource Limit](https://slurm.schedmd.com/resource_limits.html)
- [Dependencies](https://hpc.nih.gov/docs/job_dependencies.html)
- [Job array](https://slurm.schedmd.com/job_array.html)
- run `man <commandname>` for builtin documentation. For example: `man scontrol`
## Slurm job management commands
```
scontrol write batch_script job_id optional_filename
# write batch_script doesn't work on CLAAUDIA AI cloud yet
scontrol update qos=short jobid=525
```
## Slurm query commands
- `sacctmngr` - database management tool
- `sshare` - view current hierarchical fair-share information
- `sdiag` - view statistics about scheduling module operations
## sacctmngr
`sacctmgr` - database management tool
```
sacctmgr show assoc \
format=account,user,qos,tres,maxtresperuser,grptres
sacctmgr show assoc format=account,user,qos%30
sudo sacctmgr modify QOS \
normal set MaxTRESPerUser=gres/gpu=2
```
## strigger - event trigger management tools
```
# execute the program "/home/joe/clean_up"
# when job 1234 is within 10 minutes of reaching its time limit.
strigger --set --jobid=1234 --time --offset=-600 \
--program=/home/joe/clean_up
```
# Singularity basics
## Why singularity
## Why singularity?
To overcome Docker's drawbacks while still work well with Docker
1. Security
1. You get your own environment.
- Flexibility in software
- Flexibility in version
- User requests/changes does not effect others
2. Security
- root access
- resource exposure
2. Compatibility with `slurm`
3. Compatibility with `slurm`
- resource policy
3. Simplicity
4. HPC-geared
4. Simplicity
5. HPC-geared
## Check built-in documentation
@ -210,46 +192,94 @@ To overcome Docker's drawbacks while still work well with Docker
see `singularity help <command>`
## Singularity build command
## Singularity build from Docker and exec command
![Build IO](./images/build_input_output.png "Build Input Output"){width=50%}
`sudo singularity build lolcow.sif docker://godlovedc/lolcow`
Example: Pull a Docker image and convert to singularity image
# Common Use cases
`srun singularity pull docker://godlovedc/lolcow`
## Run stock docker images (not recommended)
```bash
srun --gres=gpu:2 singularity run --nv -B $HOME/data:/data \
docker://nvcr.io/nvidia/tensorflow:19.03-py3 nvidia-smi
```
and then run
`srun singularity run lolcow.sif`
## Singularity build from NGC Docker image
## Build singularity image from Docker
```srun singularity pull docker://nvcr.io/nvidia/tensorflow:20.03-tf2-py3```
Common use case for interactive work:
```console
srun --pty --gres=gpu:1 \
singularity shell --nv tensorflow_20.03-tf2-py3.sif
nvidia-smi
ipython
import tensorflow
exit
exit
```
sudo singularity build lolcow.sif docker://godlovedc/lolcow
With the last exit you released the resources. Keep multiple connections, tmux, screen, or byobo to avoid releasing.
## Combining all the steps from today
Example:
```console
srun --gres=gpu:1 singularity exec --nv \
-B .:/code -B mnist-data/:/data -B output_data:/output_data \
tensorflow_20.03-tf2-py3.sif python /code/example.py 10
```
## Build and run customized Singularity images
## Build a customized Singularity images
Singularity [ definition file ](https://www.sylabs.io/guides/3.0/user-guide/definition_files.html)
<span style="color:blue;latex-color:blue">Demo</span>: Write and build singularity image from a definition file
Example: Singularity
```console
BootStrap: docker
From: nvcr.io/nvidia/tensorflow:20.03-tf2-py3
## Singularity: run, exec, inspect, shell
%post
pip install keras
```
You can then build with
```bash
srun --gres=gpu:1 singularity run --nv \
shub://vsoch/singularity-images
```
srun singularity build --fakeroot \
tensorflow_keras.sif Singularity
```
srun --pty --gres=gpu:1 singularity shell --nv \
--home /tmp --contain --bind \
/user shub://vsoch/singularity-images
## Running your customized Singularity images
sudo singularity build sbuild.img \
shub://vsoch/singularity-images
You can the run with
singularity inspect sbuild.img
```
srun --gres=gpu:1 singularity exec --nv \
-B .:/code -B output_data:/output_data
\ tensorflow_keras.sif python /code/example.py 10
```
or enter an interactive session with
<span style="color:blue;latex-color:blue">Demo</span>: run, exec, inspect, shell
```
srun --pty --gres=gpu:1 singularity shell --nv \
-B .:/code -B output_data:/output_data
\ tensorflow_keras.sif
```
# Where to go from here
- [The user documentation](https://git.its.aau.dk/CLAAUDIA/docs_aicloud/src/branch/master/aicloud_slurm)
- More workflows
- Copying data to the local drive for higher I/O performance
- Inspecting your utilization
- Matlab, pytorch, ...
- Fair usage/upcoming deadline
- Links and references to additional material
- Support: support@its.aau.dk
- Advisory: claaudia@aau.dk
- Use the resource and give feedback. Share with us your success stories (including benchmarks, solved challenges, new possibilities, etc.)
- Share with other users at the [Yammer channel](https://web.yammer.com/main/groups/eyJfdHlwZSI6Ikdyb3VwIiwiaWQiOiI4NzM1OTg5NzYwIn0/all).

@ -60,15 +60,3 @@ output:
## Outline
\tableofcontents
# Slide file, and contact info
AI cloud usage: support@its.aau.dk
Regarding this presentation and related documents:
http://bit.do/aauaipilot2
Singularity specific:
https://groups.google.com/a/lbl.gov/forum/#!forum/singularity

Loading…
Cancel
Save