Do Deep Convolutional Nets Really Need to be Deep and Convolutional?
Authors: Gregor Urban, Krzysztof J. Geras, Samira Ebrahimi Kahou, Ozlem Aslan, Shengjie Wang, Abdelrahman Mohamed, Matthai Philipose, Matt Richardson, Rich Caruana
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper provides the first empirical demonstration that deep convolutional models really need to be both deep and convolutional, even when trained with methods such as distillation that allow small or shallow models of high accuracy to be trained. Although previous research showed that shallow feed-forward nets sometimes can learn the complex functions previously learned by deep nets while using the same number of parameters as the deep models they mimic, in this paper we demonstrate that the same methods cannot be used to train accurate models on CIFAR-10 unless the student models contain multiple layers of convolution. Although the student models do not have to be as deep as the teacher model they mimic, the students need multiple convolutional layers to learn functions of comparable accuracy as the deep convolutional teacher. |
| Researcher Affiliation | Collaboration | Gregor Urban1, Krzysztof J. Geras2, Samira Ebrahimi Kahou3, Ozlem Aslan4, Shengjie Wang5, Abdelrahman Mohamed6, Matthai Philipose6, Matt Richardson6, Rich Caruana6 1UC Irvine, USA 2University of Edinburgh, UK 3Ecole Polytechnique de Montreal, CA 4University of Alberta, CA 5University of Washington, USA 6Microsoft Research, USA |
| Pseudocode | No | The paper does not contain any pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statements about releasing source code for the described methodology or links to a code repository. |
| Open Datasets | Yes | The CIFAR-10 (Krizhevsky, 2009) data set consists of a set of natural images from 10 different object classes: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, truck. The dataset is a labeled subset of the 80 million tiny images dataset (Torralba et al., 2008) and is divided into 50,000 train and 10,000 test images. |
| Dataset Splits | Yes | The dataset is a labeled subset of the 80 million tiny images dataset (Torralba et al., 2008) and is divided into 50,000 train and 10,000 test images. Each image is 32 32 pixels in 3 color channels, yielding input vectors with 3072 dimensions. We prepared the data by subtracting the mean and dividing by the standard deviation of each image vector. We train all models on a subset of 40,000 images and use the remaining 10,000 images as the validation set for the Bayesian optimization. |
| Hardware Specification | No | The paper mentions training on 'CPU and GPU' but does not provide specific hardware details such as GPU models, CPU models, or memory specifications. |
| Software Dependencies | No | The paper states 'All models are trained using Theano (Bastien et al., 2012; Bergstra et al., 2010)' but does not specify a version number for Theano or any other software dependencies. |
| Experiment Setup | Yes | We train all models using SGD with Nesterov momentum. The initial learning rate and momentum are chosen by Bayesian optimization. We optimize eighteen hyperparameters overall: initial learning rate on [0.01, 0.05], momentum on [0.80, 0.91], l2 weight decay on [5 10 5,4 10 4], initialization coefficient on [0.8, 1.35] which scales the initial weights of the CNN, four separate dropout rates, five constants controlling the HSV data augmentation, and the four scaling constants controlling the networks layer widths. The hyperparameters we optimized in the student models are: initial learning rate, momentum, scaling of the initially randomly distributed learnable parameters, scaling of all pixel values of the input, and the scale factors that control the width of all hidden and convolutional layers in the model. |