High-Performance Large-Scale Image Recognition Without Normalization

Authors: Andy Brock, Soham De, Samuel L Smith, Karen Simonyan

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this work, we develop an adaptive gradient clipping technique which overcomes these instabilities, and design a significantly improved class of Normalizer-Free Res Nets. Our smaller models match the test accuracy of an Efficient Net-B7 on Image Net while being up to 8.7 faster to train, and our largest models attain a new state-of-the-art top-1 accuracy of 86.5%.
Researcher Affiliation Industry 1Deep Mind, London, United Kingdom. Correspondence to: Andrew Brock <ajbrock@deepmind.com>.
Pseudocode No The paper describes algorithms using mathematical formulas but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code and pretrained models are available at https://github.com/ deepmind/deepmind-research/tree/ master/nfnets
Open Datasets Yes We now turn our attention to evaluating our NFNet models on Image Net... (Russakovsky et al., 2015) ...object detection on COCO (Lin et al., 2014).
Dataset Splits Yes Figure 1. Image Net Validation Accuracy vs Training Latency.
Hardware Specification Yes Latencies are given as the time in milliseconds required to perform a single full training step on TPU or GPU (V100).
Software Dependencies No The paper mentions software components like JAX, Haiku, and NumPy, but it does not provide specific version numbers for these dependencies, which are necessary for full reproducibility.
Experiment Setup Yes We performed experiments on pre-activation NF-Res Net-50 and NF-Res Net-200 on Image Net, trained using SGD with Nesterov s Momentum for 90 epochs at a range of batch sizes between 256 and 4096. As in Goyal et al. (2017) we use a base learning rate of 0.1 for batch size 256, which is scaled linearly with the batch size. We now turn our attention to evaluating our NFNet models on Image Net, beginning with an ablation of our architectural modifications when training for 360 epochs at batch size 4096. We use Nesterov s Momentum with a momentum coefficient of 0.9, AGC as described in Section 4 with a clipping threshold of 0.01, and a learning rate which linearly increases from 0 to 1.6 over 5 epochs, before decaying to zero with cosine annealing (Loshchilov & Hutter, 2017).