@ -579,7 +579,7 @@ The NVIDIA V100 comes with specialized hardware for tensor operations called ten
1. Faster execution
2. Lower memory footprint that allows for an increased batch size.
An example on how to adapt your PyTorch code is provided [here](https://git.its.aau.dk/CLAAUDIA/docs_aicloud/src/branch/master/aicloud_slurm/torch_apex_example). The example uses [APEX](https://nvidia.github.io/apex/) automatic multi precision [AMP](https://nvidia.github.io/apex/amp.html).
An example on how to adapt your PyTorch code is provided [here](https://git.its.aau.dk/CLAAUDIA/docs_aicloud/src/branch/master/aicloud_slurm/torch_amp_example). The example uses [APEX](https://nvidia.github.io/apex/) automatic multi precision [AMP](https://nvidia.github.io/apex/amp.html) and native [Torch AMP](https://pytorch.org/docs/stable/amp.html) available in NGC from version 20.06.
## PyTorch and automatic mixed precision with APEX
The following is an example of using automatic mixed precision [(AMP)](https://nvidia.github.io/apex/amp.html) for PyTorch with [APEX](https://nvidia.github.io/apex/). The benefits in general are:
The following is an example of using automatic mixed precision [(AMP)](https://nvidia.github.io/apex/amp.html) for PyTorch with [APEX](https://nvidia.github.io/apex/) and and native [Torch AMP](https://pytorch.org/docs/stable/amp.html) available in NGC from version 20.06. The benefits in general are:
1. Faster computations due to the introduction of half-precision floats and tensor core operations with e.g. V100 GPUs.
2. Larger batch size as the loss, cache and gradients can be saved at a lower precision.
For more information, see the [training neural networks with tensor cores](https://nvlabs.github.io/eccv2020-mixed-precision-tutorial/files/dusan_stosic-training-neural-networks-with-tensor-cores.pdf) or these [videos on mixed precision training](https://developer.nvidia.com/blog/video-mixed-precision-techniques-tensor-cores-deep-learning/).
For more information, see the [training neural networks with tensor cores](https://nvlabs.github.io/eccv2020-mixed-precision-tutorial/files/dusan_stosic-training-neural-networks-with-tensor-cores.pdf) which presents two methods for doing AMP that we use below. For more information see also these [videos on mixed precision training](https://developer.nvidia.com/blog/video-mixed-precision-techniques-tensor-cores-deep-learning/).
The following example should be seen as how to approach AMP. The solution to the given problem can be computed more easily using linear least-squares and we use this for validating the results. The example is from the PyTorch [Documentation](https://pytorch.org/tutorials/beginner/pytorch_with_examples.html)
Notice the import and the two lines that needs to be different to make AMP work in this example. The [PyTorch containers from NGC](https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel_20-11.html#rel_20-11) comes with APEX. If you run this, using e.g. the slurm batch script job.sh, you should obtain the following output:
Notive the changes at particular parts of the code due to the usage of different AMP approaches (and no AMP)
```console
Using device: cuda:0
@ -136,14 +159,18 @@ Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 8.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 4.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 2.0
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 1.0
Torch with amp : MSE loss = 8.89e+00
Torch with amp : y = -5.01e-04 + 8.57e-01 x + -5.01e-04 x^2 + -9.30e-02 x^3
Torch without amp: MSE loss = 8.92e+00
Torch without amp: y = 5.00e-04 + 8.56e-01 x + 5.00e-04 x^2 + -9.38e-02 x^3
LS model : MSE loss = 8.82e+00
LS model : y = -5.91e-18 + 8.57e-01 x + 0.00e+00 x^2 + -9.33e-02 x^3
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.5
Gradient overflow. Skipping step, loss scaler 0 reducing loss scale to 0.25
Torch with amp apex : MSE loss = 8.86e+00
Torch with amp apex : y = 4.94e-04 + 8.57e-01 x + 4.99e-04 x^2 + -9.37e-02 x^3
Torch with amp native: MSE loss = 8.85e+00
Torch with amp native: y = 4.97e-04 + 8.57e-01 x + 4.98e-04 x^2 + -9.35e-02 x^3
Torch without amp : MSE loss = 8.92e+00
Torch without amp : y = 5.00e-04 + 8.57e-01 x + 5.00e-04 x^2 + -9.28e-02 x^3
LS model : MSE loss = 8.82e+00
LS model : y = -5.91e-18 + 8.57e-01 x + 0.00e+00 x^2 + -9.33e-02 x^3
```
Notice the final accuracy of Torch with and without AMP are comparable, but slightly less accurate than the exact linear least squares solution here used for validation.
Notice the final accuracy of Torch with and without AMP methods are comparable, but slightly less accurate than the exact linear least squares solution here used for validation.
It is unclear if we are actually using tensor cores in this example, but now the code is structured such that more advanced NN models can use tensor cores using the above recipe.