Revisiting ResNets: Improved Training and Scaling Strategies
Authors: Irwan Bello, William Fedus, Xianzhi Du, Ekin Dogus Cubuk, Aravind Srinivas, Tsung-Yi Lin, Jonathon Shlens, Barret Zoph
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our work revisits the canonical Res Net [13] and studies these three aspects in an effort to disentangle them. Perhaps surprisingly, we find that training and scaling strategies may matter more than architectural changes, and further, that the resulting Res Nets match recent state-of-the-art models. We show that the best performing scaling strategy depends on the training regime and offer two new scaling strategies: (1) scale model depth in regimes where overfitting can occur (width scaling is preferable otherwise); (2) increase image resolution more slowly than previously recommended [55]. Using improved training and scaling strategies, we design a family of Res Net architectures, Res Net RS, which are 1.7x 2.7x faster than Efficient Nets on TPUs, while achieving similar accuracies on Image Net. In a large-scale semi-supervised learning setup, Res Net-RS achieves 86.2% top-1 Image Net accuracy, while being 4.7x faster than Efficient Net-Noisy Student. The training techniques improve transfer performance on a suite of downstream tasks (rivaling state-of-the-art self-supervised algorithms) and extend to video classification on Kinetics-400. We recommend practitioners use these simple revised Res Nets as baselines for future research. |
| Researcher Affiliation | Collaboration | Irwan Bello Google Brain William Fedus Google Brain Xianzhi Du Google Brain Ekin D. Cubuk Google Brain Aravind Srinivas UC Berkeley Tsung-Yi Lin Google Brain Jonathon Shlens Google Brain Barret Zoph Google Brain |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and checkpoints available in Tensor Flow: https://github.com/tensorflow/models/tree/master/ official/vision/beta and https://github.com/tensorflow/tpu/tree/master/ models/official/resnet/resnet_rs |
| Open Datasets | Yes | Using improved training and scaling strategies, we design a family of Res Net architectures, Res Net RS, which are 1.7x 2.7x faster than Efficient Nets on TPUs, while achieving similar accuracies on Image Net. In a large-scale semi-supervised learning setup, Res Net-RS achieves 86.2% top-1 Image Net accuracy, while being 4.7x faster than Efficient Net-Noisy Student. The training techniques improve transfer performance on a suite of downstream tasks (rivaling state-of-the-art self-supervised algorithms) and extend to video classification on Kinetics-400. |
| Dataset Splits | Yes | To select the hyperparameters for the various regularization and training methods, we use a held-out validation set comprising 2% of the Image Net training set (20 shards out of 1024). This is referred to as the minival-set and the original Image Net validation set (the one reported in most prior works) is referred to as validation-set. |
| Hardware Specification | Yes | Using improved training and scaling strategies, we design a family of Res Net architectures, Res Net RS, which are 1.7x 2.7x faster than Efficient Nets on TPUs, while achieving similar accuracies on Image Net. In a large-scale semi-supervised learning setup, Res Net-RS achieves 86.2% top-1 Image Net accuracy, while being 4.7x faster than Efficient Net-Noisy Student. The training techniques improve transfer performance on a suite of downstream tasks (rivaling state-of-the-art self-supervised algorithms) and extend to video classification on Kinetics-400. ... Figure 4 and Table 2 compare Efficient Nets against Res Net-RS on a speed-accuracy Pareto curve. We find that Res Net-RS match Efficient Nets performance while being 1.7x 2.7x faster on TPUs (2.1x 3.3x faster on GPUs). We point that these speed-ups are superior to those obtained by TRes Nest and Res Ne St3, suggesting that Res Net-RS also outperform TRes Net and Res Ne St. ... This large speed-up over Efficient Net may be non-intuitive since Efficient Nets have significantly reduced parameters and FLOPs compared to Res Nets. We next discuss why a model with fewer parameters and fewer FLOPs (Efficient Net) is slower and more memory-intensive during training. FLOPs vs Latency. While FLOPs provide a hardware-agnostic metric for assessing computational demand, they may not be indicative of actual latency times for training and inference [19, 18, 39]. In custom hardware architectures (e.g. TPUs and GPUs), FLOPs are an especially poor proxy because operations are often bounded by memory access costs and have different levels of optimization on modern matrix multiplication units [24]. The inverted bottlenecks [46] used in Efficient Nets employ depthwise convolutions with large activations and have a small compute to memory ratio (operational intensity) compared to the Res Net s bottleneck blocks which employ dense convolutions on smaller activations. This makes Efficient Nets less efficient on modern accelerators compared to Res Nets. Figure 4 (table on the right) illustrates this point: a Res Net-RS model with 1.8x more FLOPs than Efficient Net-B6 is 2.7x faster on a TPUv3 hardware accelerator. |
| Software Dependencies | No | The paper mentions 'Tensor Flow' but does not specify a version number or list other software dependencies with specific versions. |
| Experiment Setup | Yes | Regularization and Data Augmentation. We apply weight decay, label smoothing, dropout and stochastic depth for regularization. Dropout [50] is a common technique used in computer vision and we apply it to the output after the global average pooling occurs in the final layer. Stochastic depth [22] drops out each layer in the network (that has residual connections around it) with a specified probability that is a function of the layer depth. We use Rand Augment [7] data augmentation as an additional regularizer. Rand Augment applies a sequence of random image transformations (e.g. translate, shear, color distortions) to each image independently during training. Our training method closely matches that of Efficient Net, where we train for 350 epochs, but with a few small differences (e.g. we use Momentum with cosine learning rate schedule as opposed to RMSProp with exponential decay). See Appendix D for details. Hyperparameter Tuning. To select the hyperparameters for the various regularization and training methods, we use a held-out validation set comprising 2% of the Image Net training set (20 shards out of 1024). This is referred to as the minival-set and the original Image Net validation set (the one reported in most prior works) is referred to as validation-set. Unless specified otherwise, results are reported on the validation-set. The hyperparameters of all Res Net-RS models are in Table 8 in the Appendix C. |