Do Deep Nets Really Need to be Deep?

Authors: Jimmy Ba, Rich Caruana

NeurIPS 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper we empirically demonstrate that shallow feed-forward nets can learn the complex functions previously learned by deep nets and achieve accuracies previously only achievable with deep models. On the TIMIT phoneme recognition and CIFAR-10 image recognition tasks, shallow nets can be trained that perform similarly to complex, well-engineered, deeper convolutional models.
Researcher Affiliation Collaboration Lei Jimmy Ba University of Toronto jimmy@psi.utoronto.ca Rich Caruana Microsoft Research rcaruana@microsoft.com
Pseudocode No The paper describes the methods in prose but does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about releasing source code for the described methodology or a link to a code repository.
Open Datasets Yes The TIMIT speech corpus has 462 speakers in the training set... CIFAR-10 consists of a set of natural images from 10 different object classes... The dataset is a labeled subset of the 80 million tiny images dataset[18] and is divided into 50,000 train and 10,000 test images.
Dataset Splits Yes The TIMIT speech corpus has 462 speakers in the training set, a separate development set for crossvalidation that includes 50 speakers, and a final test set with 24 speakers.
Hardware Specification Yes In our experiments the deep models usually required 8 12 hours to train on Nvidia GTX 580 GPUs to reach the state-of-the-art performance on TIMIT and CIFAR-10 datasets.
Software Dependencies No The paper does not provide specific software dependency details with version numbers (e.g., specific library names like PyTorch, TensorFlow, or scikit-learn with their versions).
Experiment Setup No The paper describes general aspects of the experimental setup, such as network architectures (e.g., '2000 rectified linear units per layer'), data pre-processing steps (e.g., '25ms Hamming window shifting by 10ms', 'ZCA whitening'), and general training algorithms ('stochastic gradient descent with momentum'), but it does not provide specific numerical hyperparameters like learning rates or batch sizes.