Do Deep Nets Really Need to be Deep?
Authors: Jimmy Ba, Rich Caruana
NeurIPS 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper we empirically demonstrate that shallow feed-forward nets can learn the complex functions previously learned by deep nets and achieve accuracies previously only achievable with deep models. On the TIMIT phoneme recognition and CIFAR-10 image recognition tasks, shallow nets can be trained that perform similarly to complex, well-engineered, deeper convolutional models. |
| Researcher Affiliation | Collaboration | Lei Jimmy Ba University of Toronto jimmy@psi.utoronto.ca Rich Caruana Microsoft Research rcaruana@microsoft.com |
| Pseudocode | No | The paper describes the methods in prose but does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code for the described methodology or a link to a code repository. |
| Open Datasets | Yes | The TIMIT speech corpus has 462 speakers in the training set... CIFAR-10 consists of a set of natural images from 10 different object classes... The dataset is a labeled subset of the 80 million tiny images dataset[18] and is divided into 50,000 train and 10,000 test images. |
| Dataset Splits | Yes | The TIMIT speech corpus has 462 speakers in the training set, a separate development set for crossvalidation that includes 50 speakers, and a final test set with 24 speakers. |
| Hardware Specification | Yes | In our experiments the deep models usually required 8 12 hours to train on Nvidia GTX 580 GPUs to reach the state-of-the-art performance on TIMIT and CIFAR-10 datasets. |
| Software Dependencies | No | The paper does not provide specific software dependency details with version numbers (e.g., specific library names like PyTorch, TensorFlow, or scikit-learn with their versions). |
| Experiment Setup | No | The paper describes general aspects of the experimental setup, such as network architectures (e.g., '2000 rectified linear units per layer'), data pre-processing steps (e.g., '25ms Hamming window shifting by 10ms', 'ZCA whitening'), and general training algorithms ('stochastic gradient descent with momentum'), but it does not provide specific numerical hyperparameters like learning rates or batch sizes. |