EfficientNetV2: Smaller Models and Faster Training

Authors: Mingxing Tan, Quoc Le

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that Efficient Net V2 models train much faster than state-of-the-art models while being up to 6.8x smaller. With progressive learning, our Efficient Net V2 significantly outperforms previous models on Image Net and CIFAR/Cars/Flowers datasets.
Researcher Affiliation Industry 1Google Research, Brain Team. Correspondence to: Mingxing Tan <tanmingxing@google.com>.
Pseudocode Yes Algorithm 1 Progressive learning with adaptive regularization.
Open Source Code Yes Code is available at https://github.com/google/ automl/tree/master/efficientnetv2.
Open Datasets Yes Image Net ILSVRC2012 (Russakovsky et al., 2015) contains about 1.28M training images and 50,000 validation images with 1000 classes. Image Net21k (Russakovsky et al., 2015) contains about 13M training images with 21,841 classes. We evaluate our models on four transfer learning datasets: CIFAR-10, CIFAR-100, Flowers and Cars. Table 9 includes the statistics of these datasets.
Dataset Splits Yes Image Net ILSVRC2012 (Russakovsky et al., 2015) contains about 1.28M training images and 50,000 validation images with 1000 classes. During architecture search or hyperparameter tuning, we reserve 25,000 images (about 2%) from the training set as minival for accuracy evaluation.
Hardware Specification Yes Training time is measured with 32 TPU cores. All Efficient Net V2 models are trained with progressive learning. Our Efficient Net V2 and progressive learning also make it easier to train models on larger datasets. For example, Image Net21k (Russakovsky et al., 2015) is about 10x larger than Image Net ILSVRC2012, but our Efficient Net V2 can finish the training within two days using moderate computing resources of 32 TPUv3 cores. Infer-time is measured on V100 GPU FP16 with batch size 16 using the same codebase (Wightman, 2021); Train-time is the total training time normalized for 32 TPU cores.
Software Dependencies No The paper mentions using 'RMSProp optimizer', 'batch norm', 'Rand Augment', 'Mixup', 'Dropout', 'stochastic depth', and the 'Py Torch Image Models codebase (Wightman, 2021)'. However, it does not provide specific version numbers for PyTorch or other software libraries.
Experiment Setup Yes Our Image Net training settings largely follow Efficient Nets (Tan & Le, 2019a): RMSProp optimizer with decay 0.9 and momentum 0.9; batch norm momentum 0.99; weight decay 1e-5. Each model is trained for 350 epochs with total batch size 4096. Learning rate is first warmed up from 0 to 0.256, and then decayed by 0.97 every 2.4 epochs. We use exponential moving average with 0.9999 decay rate, Rand Augment (Cubuk et al., 2020), Mixup (Zhang et al., 2018), Dropout (Srivastava et al., 2014), and stochastic depth (Huang et al., 2016) with 0.8 survival probability. For progressive learning, we divide the training process into four stages with about 87 epochs per stage: the early stage uses a small image size with weak regularization, while the later stages use larger image sizes with stronger regularization, as described in Algorithm 1. Table 6 shows the minimum (for the first stage) and maximum (for the last stage) values of image size and regularization.