When and Why Are Deep Networks Better Than Shallow Ones?

Authors: Hrushikesh Mhaskar, Qianli Liao, Tomaso Poggio

AAAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this paper we describe a set of approximation theory results that include an answer to why and when deep networks are better than shallow the first question and suggest a possible answer to the second one. We formulate our results by using the idealized model of a deep network as a binary tree.
Researcher Affiliation Academia Hrushikesh Mhaskar,1,2 Qianli Liao,3 Tomaso Poggio3 1 Department of Mathematics, California Institute of Technology, Pasadena, CA, 91125 2 Institute of Mathematical Sciences, Claremont Graduate University, Claremont, CA, 91711 3 Center for Brains, Minds, and Machines, Mc Govern Institute for Brain Research Massachusetts Institute of Technology, Cambridge, MA, 02139
Pseudocode No No pseudocode or algorithm blocks were found in the paper.
Open Source Code No The paper does not mention providing open-source code for the methodology described.
Open Datasets No The paper refers to '60k training and 60k testing samples were drawn from a uniform distribution over [ 2π, 2π]' for the example in Figure 2, which describes data generation, not a publicly available dataset with concrete access information. CIFAR-10 is mentioned but in reference to other work: 'simple Conv Nets with and without weight sharing perform similarly on CIFAR-10 (see (Poggio et al. 2016)).'
Dataset Splits No The paper states '60k training and 60k testing samples were drawn from a uniform distribution over [ 2π, 2π]' in the context of Figure 2, but does not explicitly mention a validation split or its size.
Hardware Specification No The paper mentions 'standard DLNN software' and 'Mat Conv Net' but does not specify any hardware details like GPU or CPU models used for the experiments.
Software Dependencies No The paper mentions 'Mat Conv Net (Vedaldi and Lenc 2015)' and 'Stochastic gradient descent with momentum 0.9 and learning rate 0.0001', but does not provide specific version numbers for software or libraries.
Experiment Setup Yes Mean squared error (MSE) was used as the objective function; the y axes in the above figures are the square root of the testing MSE. For the experiments with 2 and 3 hidden layers, batch normalization (Ioffe and Szegedy 2015) was used between every two hidden layers. 60k training and 60k testing samples were drawn from a uniform distribution over [ 2π, 2π]. The training process consisted of 2000 passes through the entire training data with mini batches of size 3000. Stochastic gradient descent with momentum 0.9 and learning rate 0.0001 was used.