Non-deep Networks

Authors: Ankit Goyal, Alexey Bochkovskiy, Jia Deng, Vladlen Koltun

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that it is. To do so, we use parallel subnetworks instead of stacking one layer after another. This helps effectively reduce depth while maintaining high performance. By utilizing parallel substructures, we show, for the first time, that a network with a depth of just 12 can achieve top-1 accuracy over 80% on Image Net, 96% on CIFAR10, and 81% on CIFAR100. We also show that a network with a low-depth (12) backbone can achieve an AP of 48% on MS-COCO.
Researcher Affiliation Collaboration Ankit Goyal1,2 Alexey Bochkovskiy2,3 Jia Deng1 Vladlen Koltun2,3 1Princeton University 2Intel Labs 3Apple
Pseudocode No The paper includes architectural diagrams (Figure 2, Figure A1) but no formal pseudocode or algorithm blocks.
Open Source Code No 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No]
Open Datasets Yes Experiments on Image Net. Image Net (Deng et al., 2009) is a large-scale image classification dataset... We evaluate on the ILSVRC2012 (Russakovsky et al., 2015) dataset... MS-COCO (Lin et al., 2014) is an object detection dataset... The CIFAR datasets consist of colored natural images with 32 32 pixels. CIFAR-10 consists of images drawn from 10 and CIFAR-100 from 100 classes.
Dataset Splits Yes ILSVRC2012 (Russakovsky et al., 2015) dataset, which consists of 1.28M training images and 50K validation images with 1000 classes.
Hardware Specification Yes Speed was measured on a Ge Force RTX 3090 with Pytorch 1.8.1 and CUDA 11.1.
Software Dependencies Yes Speed was measured on a Ge Force RTX 3090 with Pytorch 1.8.1 and CUDA 11.1.
Experiment Setup Yes We train our models for 120 epochs using the SGD optimizer, a step scheduler with a warmup for first 5 epochs, a learning rate decay of 0.1 at every 30th epoch, an initial learning rate of 0.8, and a batch size of 2048 (256 per GPU)... We train for 400 epochs with a batch size of 128. The initial learning rate is 0.1 and is decreased by a factor of 5 at 30%, 60%, and 80% of the epochs as in (Zagoruyko & Komodakis, 2016). Similar to prior works (Zagoruyko & Komodakis, 2016; Huang et al., 2016), we use a weight decay of 0.0003 and set dropout in the convolution layer at 0.2 and dropout in the final fully-connected layer at 0.2 for all our networks on both datasets.