Documentations for AI cloud
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
 
docs_aicloud/aicloud_slurm/multi_gpu_keras
Tobias Lindstrøm Jensen 828cc0f33d Added link to multi-GPU Horovod training from the main doc 2 years ago
..
Makefile Worked on verification of using NVLINK 2 years ago
README.md Added link to multi-GPU Horovod training from the main doc 2 years ago
Singularity Worked on verification of using NVLINK 2 years ago
example.py Worked on verification of using NVLINK 2 years ago
jobscript.sh Worked on verification of using NVLINK 2 years ago

README.md

Multi-GPU with Tensorflow-Keras and Horovod

There several methods to perform multi-GPU training. In this example we consider a example with Keras bundled with TensorFlow and Horovod.

https://horovod.readthedocs.io/en/stable/

Horovod is a distributed deep learning data parallelism framework that supports Keras, PyTorch, MXNet and TensorFlow. In this example we will look at training on a single node using Keras with OpenMPI, NCCL and NVLink behind the scenes.

Newer images from NGC come with Horovod leveraging a number of features on the system, so in this example we first build a standard TensorFlow image (including Keras).

$ make build

You can see how it is built in the files Makefile and Singularity. Next we update a few lines in our code. Have a look at the Horovod-Keras documentation or in example.py.

Next we build our setup as a Slurm batch script

#!/bin/bash
#SBATCH --job-name MGPU
#SBATCH --time=1:00:00
#SBATCH --qos=allgpus
#SBATCH --gres=gpu:4
#SBATCH --mem=60G
#SBATCH --cpus-per-gpu=4
#SBATCH --ntasks=4

echo "Date              = $(date)"
echo "Hostname          = $(hostname -s)"
echo "Working Directory = $(pwd)"
echo "JOB ID            = $SLURM_JOB_ID"
echo ""
echo "Hostname                       = $SLURM_NODELIST"
echo "Number of Tasks Allocated      = $SLURM_NTASKS"
echo "Number of CPUs on host         = $SLURM_CPUS_ON_NODE"
echo "GPUs                           = $GPU_DEVICE_ORDINAL"

nvidia-smi nvlink -gt d > nvlink_start-$SLURM_JOB_ID.out
nvidia-smi --query-gpu=index,timestamp,utilization.gpu,utilization.memory,memory.total,memory.used,memory.free --format=csv -l 5 > util-$SLURM_JOB_ID.csv &
singularity exec --nv -B .:/code -B output_data:/output_data tensorflow_keras.sif horovodrun -np $SLURM_NTASKS --mpi-args="-x NCCL_DEBUG=INFO" python /code/example.py
nvidia-smi nvlink -gt d > nvlink_end-$SLURM_JOB_ID.out

In this case we are allocating 4 GPUs. Note that we also select 4 tasks. We then proceed with recording a few pieces of information in our slurm-<jobid> output that could come in handy. Next we would like to collect a bit of information on the amount of data transferred over the NVLinks - we both do this at the start and end of the job such that we can see the difference. We also collect the GPU utilization in a file to see how well the GPU is utilized.

The line

singularity exec --nv -B .:/code -B output_data:/output_data tensorflow_keras.sif horovodrun -np $SLURM_NTASKS --mpi-args="-x NCCL_DEBUG=INFO" python /code/example.py

is the key line here. We use our TensorFlow/Keras image and run the MPI wrapper horovodrun using SLURM_NTASKS=4 processes. We here use an inside-out or self-contained approach where we use horovodrun/mpirun from inside the container.

We can also check the available features of Horovod from inside the container using:

horovodrun --check-build


Available Frameworks:
    [X] TensorFlow
    [ ] PyTorch
    [ ] MXNet

Available Controllers:
    [X] MPI
    [X] Gloo

Available Tensor Operations:
    [X] NCCL
    [ ] DDL
    [ ] CCL
    [X] MPI
    [X] Gloo    

We can then have a look at the different solutions with say 1 and 4 GPUs using the Makefile running the batch scripts as

make run
make runsingle

and investigate the runtime ('slurm-<jobid>'), GPU utilization ('util-<jobid>') and the amount of data moved during execution in 'nvlink_start-<jobid>' and 'nvlink_end-<jobid>'.

First, we have a look in 'slurm-<jobid>' and observe lines with P2P indicating usage of NVLINK:

[1,0]<stdout>:nv-ai-03:62491:62509 [0] NCCL INFO Channel 11 : 0[39000] -> 1[57000] via P2P/IPC
....
[1,0]<stdout>:nv-ai-03:62491:62509 [0] NCCL INFO 12 coll channels, 16 p2p channels, 16 p2p channels per peer

We can also have a look and compare 'nvlink_start-<jobid>' and 'nvlink_end-<jobid>' to see the delta between the two times.

start:

...
GPU 3: Tesla V100-SXM3-32GB (UUID: GPU-295726b4-8888-d1eb-5965-a70cbb91d136)
	 Link 0: Data Tx: 19241022 KiB
	 Link 0: Data Rx: 19255516 KiB
...

end:

...
GPU 3: Tesla V100-SXM3-32GB (UUID: GPU-295726b4-8888-d1eb-5965-a70cbb91d136)
	 Link 0: Data Tx: 20886285 KiB
	 Link 0: Data Rx: 20904969 KiB
...

For the single GPU job, the delta is zero.

We can also observe the final outputs of the 4-GPU training

cat slurm-95659.out | (head -n 10; tail -n 3)
Date              = Wed  3 Feb 10:15:18 CET 2021
Hostname          = nv-ai-03
Working Directory = /user/its.aau.dk/tlj/docs_aicloud/aicloud_slurm/multi_gpu_keras
JOB ID            = 95659

Hostname                       = nv-ai-03.srv.aau.dk
Number of Tasks Allocated      = 4
Number of CPUs on host         = 8
GPUs                           = 0,1,2,3
2021-02-03 10:15:23.012238: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1,0]<stdout>:Test loss: 0.01960514299571514
[1,0]<stdout>:Test accuracy: 0.9936000108718872
[1,0]<stdout>:Train time: 124.74971508979797

and the single-GPU training

$ cat slurm-95660.out | (head -n 10; tail -n 3)
Date              = Wed  3 Feb 10:15:21 CET 2021
Hostname          = nv-ai-03
Working Directory = /user/its.aau.dk/tlj/docs_aicloud/aicloud_slurm/multi_gpu_keras
JOB ID            = 95660

Hostname                       = nv-ai-03.srv.aau.dk
Number of Tasks Allocated      = 1
Number of CPUs on host         = 2
GPUs                           = 0
2021-02-03 10:15:24.461425: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
[1,0]<stdout>:Test loss: 0.022991640493273735
[1,0]<stdout>:Test accuracy: 0.9926000237464905
[1,0]<stdout>:Train time: 472.6957468986511

Due to random start and change of effective batch size, we will not end at the exact same solution but we do observe comparable results, and that the 4-GPU solution is approximately x4 times faster.

This job is with a small model and dataset, so it is difficult to achieve high utilization. This is an example to get started :-)