Added link to multi-GPU Horovod training from the main doc

pull/6/head
parent 409d5155c8
commit 828cc0f33d
  1. 6
      aicloud_slurm/README.md
  2. 2
      aicloud_slurm/multi_gpu_keras/README.md

@ -42,6 +42,7 @@ The problem had disappeared for some time, but now emerged again. We will invest
- [TensorFlow and Jupyter notebook using TensorFlow container](#tensorflow-and-jupyter-notebook-using-tensorflow-container)
- [PyTorch and Anaconda](#pytorch-and-anaconda)
- [PyTorch and multi-precision training](#pytorch-and-multi-precision-training)
- [Multi-GPU data parallelism training with Horovod and Keras](#multi-gpu-data-parallelism-training-with-horovod-and-keras)
- [TensorFlow with Spyder 3 GUI](#tensorflow-with-spyder-3-gui)
- [Matlab](#matlab)
- [Priority](#priority)
@ -602,6 +603,11 @@ The NVIDIA Tesla V100 comes with specialized hardware for tensor operations call
An example on how to adapt your PyTorch code is provided [here](https://git.its.aau.dk/CLAAUDIA/docs_aicloud/src/branch/master/aicloud_slurm/torch_amp_example). The example uses [APEX](https://nvidia.github.io/apex/) automatic multi precision [AMP](https://nvidia.github.io/apex/amp.html) and native [Torch AMP](https://pytorch.org/docs/stable/amp.html) available in NGC from version 20.06.
## Multi-GPU data parallelism training with Horovod and Keras
The NVIDIA DGX-2 comes with specialized hardware for moving data between GPUs: NVLinks and NVSwitches. One approach to utilizing these links is using the MVDIA Collective Communication Library ([NCCL](https://developer.nvidia.com/NCCL)). NCCL is compatible with the Message Passing Library (MPI) used in many HPC applications and facilities. This in turn is build into the Horovod framework for data parallelism training supporting many deep learning frameworks requiring only minor changes in the source code. In [this example](https://git.its.aau.dk/CLAAUDIA/docs_aicloud/src/branch/master/aicloud_slurm/multi_gpu_keras) we show how to run Horovod on our system, including Slurm settings. You can then adapt this example for you preferred framework as described in the [Horovod documentation](https://horovod.readthedocs.io/en/stable/)
## Tensorflow with spyder3 GUI
It is possible to start a GUI in the singularity container and show graphical elements. In this example we will start the IDE Spyder 3 using X11 forwarding. First connect to the AI Cloud with X11 forwarding enabled

@ -4,7 +4,7 @@ There several methods to perform multi-GPU training. In this example we consider
https://horovod.readthedocs.io/en/stable/
Horovod is a distributed deep learning framework that supports Keras, PyTorch, MXNet and TensorFlow. In this example we will look at training on a single node using Keras with OpenMPI, NCCL and NVLink behind the scenes.
Horovod is a distributed deep learning data parallelism framework that supports Keras, PyTorch, MXNet and TensorFlow. In this example we will look at training on a single node using Keras with OpenMPI, NCCL and NVLink behind the scenes.
Newer images from NGC [come with Horovod](https://on-demand.gputechconf.com/gtc-cn/2018/pdf/CH8209.pdf) leveraging a number of features on the system, so in this example we first build a standard TensorFlow image (including Keras).

Loading…
Cancel
Save