reproducibilityindex.ai

Revisiting ResNets: Improved Training and Scaling Strategies

Authors: Irwan Bello, William Fedus, Xianzhi Du, Ekin Dogus Cubuk, Aravind Srinivas, Tsung-Yi Lin, Jonathon Shlens, Barret Zoph

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our work revisits the canonical Res Net [13] and studies these three aspects in an effort to disentangle them. Perhaps surprisingly, we ﬁnd that training and scaling strategies may matter more than architectural changes, and further, that the resulting Res Nets match recent state-of-the-art models. We show that the best performing scaling strategy depends on the training regime and offer two new scaling strategies: (1) scale model depth in regimes where overﬁtting can occur (width scaling is preferable otherwise); (2) increase image resolution more slowly than previously recommended [55]. Using improved training and scaling strategies, we design a family of Res Net architectures, Res Net RS, which are 1.7x 2.7x faster than Efﬁcient Nets on TPUs, while achieving similar accuracies on Image Net. In a large-scale semi-supervised learning setup, Res Net-RS achieves 86.2% top-1 Image Net accuracy, while being 4.7x faster than Efﬁcient Net-Noisy Student. The training techniques improve transfer performance on a suite of downstream tasks (rivaling state-of-the-art self-supervised algorithms) and extend to video classiﬁcation on Kinetics-400. We recommend practitioners use these simple revised Res Nets as baselines for future research.
Researcher Affiliation	Collaboration	Irwan Bello Google Brain William Fedus Google Brain Xianzhi Du Google Brain Ekin D. Cubuk Google Brain Aravind Srinivas UC Berkeley Tsung-Yi Lin Google Brain Jonathon Shlens Google Brain Barret Zoph Google Brain
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code and checkpoints available in Tensor Flow: https://github.com/tensorflow/models/tree/master/ official/vision/beta and https://github.com/tensorflow/tpu/tree/master/ models/official/resnet/resnet_rs
Open Datasets	Yes	Using improved training and scaling strategies, we design a family of Res Net architectures, Res Net RS, which are 1.7x 2.7x faster than Efﬁcient Nets on TPUs, while achieving similar accuracies on Image Net. In a large-scale semi-supervised learning setup, Res Net-RS achieves 86.2% top-1 Image Net accuracy, while being 4.7x faster than Efﬁcient Net-Noisy Student. The training techniques improve transfer performance on a suite of downstream tasks (rivaling state-of-the-art self-supervised algorithms) and extend to video classiﬁcation on Kinetics-400.
Dataset Splits	Yes	To select the hyperparameters for the various regularization and training methods, we use a held-out validation set comprising 2% of the Image Net training set (20 shards out of 1024). This is referred to as the minival-set and the original Image Net validation set (the one reported in most prior works) is referred to as validation-set.
Hardware Specification	Yes	Using improved training and scaling strategies, we design a family of Res Net architectures, Res Net RS, which are 1.7x 2.7x faster than Efﬁcient Nets on TPUs, while achieving similar accuracies on Image Net. In a large-scale semi-supervised learning setup, Res Net-RS achieves 86.2% top-1 Image Net accuracy, while being 4.7x faster than Efﬁcient Net-Noisy Student. The training techniques improve transfer performance on a suite of downstream tasks (rivaling state-of-the-art self-supervised algorithms) and extend to video classiﬁcation on Kinetics-400. ... Figure 4 and Table 2 compare Efﬁcient Nets against Res Net-RS on a speed-accuracy Pareto curve. We ﬁnd that Res Net-RS match Efﬁcient Nets performance while being 1.7x 2.7x faster on TPUs (2.1x 3.3x faster on GPUs). We point that these speed-ups are superior to those obtained by TRes Nest and Res Ne St3, suggesting that Res Net-RS also outperform TRes Net and Res Ne St. ... This large speed-up over Efﬁcient Net may be non-intuitive since Efﬁcient Nets have signiﬁcantly reduced parameters and FLOPs compared to Res Nets. We next discuss why a model with fewer parameters and fewer FLOPs (Efﬁcient Net) is slower and more memory-intensive during training. FLOPs vs Latency. While FLOPs provide a hardware-agnostic metric for assessing computational demand, they may not be indicative of actual latency times for training and inference [19, 18, 39]. In custom hardware architectures (e.g. TPUs and GPUs), FLOPs are an especially poor proxy because operations are often bounded by memory access costs and have different levels of optimization on modern matrix multiplication units [24]. The inverted bottlenecks [46] used in Efﬁcient Nets employ depthwise convolutions with large activations and have a small compute to memory ratio (operational intensity) compared to the Res Net s bottleneck blocks which employ dense convolutions on smaller activations. This makes Efﬁcient Nets less efﬁcient on modern accelerators compared to Res Nets. Figure 4 (table on the right) illustrates this point: a Res Net-RS model with 1.8x more FLOPs than Efﬁcient Net-B6 is 2.7x faster on a TPUv3 hardware accelerator.
Software Dependencies	No	The paper mentions 'Tensor Flow' but does not specify a version number or list other software dependencies with specific versions.
Experiment Setup	Yes	Regularization and Data Augmentation. We apply weight decay, label smoothing, dropout and stochastic depth for regularization. Dropout [50] is a common technique used in computer vision and we apply it to the output after the global average pooling occurs in the ﬁnal layer. Stochastic depth [22] drops out each layer in the network (that has residual connections around it) with a speciﬁed probability that is a function of the layer depth. We use Rand Augment [7] data augmentation as an additional regularizer. Rand Augment applies a sequence of random image transformations (e.g. translate, shear, color distortions) to each image independently during training. Our training method closely matches that of Efﬁcient Net, where we train for 350 epochs, but with a few small differences (e.g. we use Momentum with cosine learning rate schedule as opposed to RMSProp with exponential decay). See Appendix D for details. Hyperparameter Tuning. To select the hyperparameters for the various regularization and training methods, we use a held-out validation set comprising 2% of the Image Net training set (20 shards out of 1024). This is referred to as the minival-set and the original Image Net validation set (the one reported in most prior works) is referred to as validation-set. Unless speciﬁed otherwise, results are reported on the validation-set. The hyperparameters of all Res Net-RS models are in Table 8 in the Appendix C.