Improving fastai’s mixed precision support with NVIDIA’s Automatic Mixed Precision.

TL;DR: For best results with mixed precision training, use NVIDIA’s Automatic Mixed Precision together with fastai, and remember to set any epsilons, for example in the optimizer, correctly.


Newer NVIDIA GPUs such as the consumer RTX range, the Tesla V100 and others have hardware support for half-precision / fp16 tensors.

This is interesting, because many deep neural networks still function perfectly if you store most of their parameters using the far more compact 16-bit floating point precision. The newer hardware (sometimes called TensorCores) is able to accelerate further these half precision operations.

In other words, with one of the newer cards, you’ll be able to fit a significantly larger neural network into the usually quite limited GPU memory (with CNNs, I can work with networks that are 80% larger), and you’ll be able to train that network substantially faster.

fastai has built-in support for mixed-precision training, but NVIDIA’s AMP has better support due to its support of dynamic, instead of static, loss scaling.

In the rest of this blog post, I briefly explain the two steps you need to take to get all of this working.

Step 1: Set epsilon so it doesn’t disappear under fp16

I’m mentioning this first so you don’t miss it.

Even after adding AMP to your configuration, you might still see NaNs during network training.

If you’re lucky, you will run into this post on the PyTorch forums.

In short, the torch.optim.Adam optimizer, and probably a number of other optimizers in PyTorch, take an epsilon argument which is added to possibly small denominators to avoid dividing by zero.

The default value of epsilon is 1e-8. Whoops!

Under fp16 encoding, 1e-8 becomes 0, and so it won’t really help to fix your possibly zero denominators.

The fix is simple, supply a larger epsilon.

Because I’m using fastai’s Learner directly, and this takes a callable for the optimization function, I created a partial:

# create fp16-safe AdamW
# see:
# default 1e-8 rounded to 0
# down to 1e-7 can still be handled
# this eps is used to prevent divide by zero errors
from functools import partial
AdamW16 = partial(torch.optim.Adam, betas=(0.9,0.99), eps=1e-4)

# then stick model + databunch into new Learner
learner = fai.basic_train.Learner(data, model, loss_func=ml_sm_loss, metrics=metrics, opt_func=AdamW16)

Step 2: Setup NVIDIA’s Automatic Mixed Precision

fastai’s built-in support for mixed precision training certainly works in many cases. However, it uses a configurable static loss scaling parameter (default 512.0), which in some cases won’t get you as far as dynamic loss scaling.

With dynamic loss scaling, the scaling factor is continuously adapted to squeeze the most out of the available precision.

(You could read sgugger’s excellent summary of mixed precision training on the fastai forums.)

I was trying to fit a squeeze and excitation ResNeXt-50 32×4 with image size 400×400 and batch size 24 into the 8GB RAM of the humble but hard-working RTX2070 in my desktop, so I needed all of the dynamic scaling help I could get.

After having applied the epsilon fix mentioned above, you will then install NVIDIA Apex, and finally make three changes to your and fastai’s code.

Install NVIDIA Apex

Download and install NVIDIA Apex into the Python environment you’re using for your fastai experiment.

conda activate your_fastai_env
cd ~
git clone
cd apex
python install --cuda_ext --cpp_ext

If apex does not build, you can also try without --cude_ext --cpp_ext, although it’s best if you can get the extensions built.

Modify your training script

At the top if your training script, before any other imports (especially anything to do with PyTorch), add the following:

from apex import amp
amp_handle = amp.init(enabled=True)

This will initialise apex, enabling it to hook into a number of PyTorch calls.

Modify fastai’s training loop

You will have to modify fastai’s, which you should be able to find in your_env_dir/lib/python3.7/site-packages/fastai/. Check and double-check that you have the right file.

At the top of this file, before any other imports, add the following:

from apex.amp import amp
# retrieve initialised AMP handle
amp_handle = amp._DECORATOR_HANDLE

Then, edit the loss_batch function according to the following instructions and code-snippet. You will only add two new code lines which will replace the loss.backward() that you will be commenting out.

if opt is not None:
    loss = cb_handler.on_backward_begin(loss)

    # The following lines REPLACE the commented-out "loss.backward()"
    # opt is an OptimWrapper -- unwrap to get real optimizer
    with amp_handle.scale_loss(loss, opt.opt) as scaled_loss:

    # loss.backward()

All of this is merely following NVIDIA AMP’s usage instructions, which I most recently tested on fastai v1.0.42, latest at the time of this writing.


If everything goes according to plan, you should be able to obtain the following well-known graph with a much larger network that you otherwise would have been able to.

The below example learning-rate finder plot was done with the se-resnext50-32x4d, image size 400×400, batch size 24 on my RTX 2070 as mentioned above. The procedure documented in this post works equally well on high end units such as the V100.


PyTorch 1.0 preview (Dec 6, 2018) packages with full CUDA 10 support for your Ubuntu 18.04 x86_64 systems.

(The wheel has now been updated to the latest PyTorch 1.0 preview as of December 6, 2018.)

You’ve just received a shiny new NVIDIA Turing (RTX 2070, 2080 or 2080 Ti), or maybe even a beautiful Tesla V100, and now you would like to try out mixed precision (well mostly fp16) training on those lovely tensor cores, using PyTorch on an Ubuntu 18.04 LTS x86_64 system.


The idea is that these tensor cores chew through fp16 much faster than they do through fp32. In practice, neural networks tolerate having large parts of themselves living in fp16, although one does have to be careful with this. Furthermore, fp16 promises to save a substantial amount of graphics memory, enabling one to train bigger models.

For full fp16 support on the Turing architecture, CUDA 10 is currently the best option. Also, a number of CUDA 10 specific improvements were made to PyTorch after the 0.4.1 release.

However, PyTorch 1.0 (first release after 0.4.1) is not quite ready yet, and neither is it easy to find CUDA 10 builds of the current PyTorch 1.0 preview / PyTorch nightly.

Oh noes…

Well, fret no more!

Here you’ll be able to find a fully CUDA 10 based build (pip wheel format) of PyTorch master as on November 10 (updated!), 2018, up to and including commit b5db6ac. I’ve linked it with a fully CUDA 10 based build of MAGMA 2.4.0 as well, which I built as a conda package.

Installing and using these packages.

Ensure that you have an Ubuntu 18.04 LTS system with CUDA 10 and CUDNN installed and configured. See this great CUDA 10 howto by Puget Systems.

After this, you will also need to download CUDNN 7.1 packages for your system from the NVIDIA Developer site. An NVIDIA developer account (free signup) is required for this. I downloaded and installed libcudnn7_7.4.1.5-1+cuda10.0_amd64.deb and libcudnn7-dev_7.4.1.5-1+cuda10.0_amd64.deb but you’ll probably only need the former.

Setup a suitable conda environment with Python 3.7. Setup and activate with something like the following:

conda create -n pt python=3.7 numpy mkl mkl-include setuptools cmake cffi typing
conda activate pt
conda install -c mingfeima mkldnn

You can now download the PyTorch nightly wheel of 2018-12-06 (347MB) and install with:

pip install torch-1.0.0a0+b5db6ac+20181206-cp37-cp37m-linux_x86_64.whl

The libraries in the wheel don’t have the conda-style relative RUNPATH correctly set, so you have to set LD_LIBRARY_PATH every time when starting your jupyter or any other Python code. This should work:


You’re now good to go!

First tests of mixed precision training with on Tesla V100.

I fired up a Google Compute Engine with Tesla V100 node in Amsterdam to check that everything works.

I used the latest version of the fastai library, and specifically the callbacks.fp16 notebook which forms part of the brilliant new fastai documentation generation system. See for example the generated page on the fp16 callbacks.

Below I show the MNIST example code where I tried to compare fp32 with fp16 (well, mixed precision to be precise) training.

The simple CNN trains up to 97% accuracy in 8 seconds, which is pretty quick already, but I could not see any training speed difference between fp16 and fp32. This could very well be because the network is so tiny.

However, I could confirm that the model parameters (at the very least) were all stored in fp16 floats when using the to_fp16() Learner method.

Train CNN with fp16

from fastai import *
from import *
path = untar_data(URLs.MNIST_SAMPLE)
data = ImageDataBunch.from_folder(path)
model = simple_cnn((3,16,16,2))
learn = Learner(data, model, metrics=[accuracy]).to_fp16()
Total time: 00:08
epoch  train_loss  valid_loss  accuracy
1      0.202592    0.139505    0.948970  (00:01)
2      0.112530    0.103523    0.967125  (00:01)
3      0.079813    0.063746    0.973994  (00:01)
4      0.066733    0.056465    0.976938  (00:01)
5      0.069775    0.055017    0.977429  (00:01)

Check that type of parameters is half:

for p in model.parameters():

Train CNN with fp32

model32 = simple_cnn((3,16,16,2))
learn32 = Learner(data, model32, metrics=[accuracy])
Total time: 00:08
epoch  train_loss  valid_loss  accuracy
1      0.213889    0.151780    0.942100  (00:01)
2      0.106975    0.092190    0.966634  (00:01)
3      0.084529    0.083353    0.973013  (00:01)
4      0.069017    0.066023    0.976938  (00:01)
5      0.060235    0.056738    0.980373  (00:01)

Check that type of model parameters is full float:

for p in model32.parameters():