Training Very Deep Networks
Authors: Rupesh K. Srivastava, Klaus Greff, Jürgen Schmidhuber
NeurIPS 2015 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 3 Experiments All networks were trained using SGD with momentum. An exponentially decaying learning rate was used in Section 3.1. For the rest of the experiments, a simpler commonly used strategy was employed where the learning rate starts at a value λ and decays according to a fixed schedule by a factor γ. λ, γ and the schedule were selected once based on validation set performance on the CIFAR-10 dataset, and kept fixed for all experiments. |
| Researcher Affiliation | Academia | The Swiss AI Lab IDSIA / USI / SUPSI {rupesh, klaus, juergen}@idsia.ch |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Source code, hyperparameter search results and related scripts are publicly available at http://people. idsia.ch/ rupesh/very_deep_learning/. |
| Open Datasets | Yes | We trained both plain and highway networks of varying varying depths on the MNIST digit classification dataset. ... Experiments on CIFAR-10 and CIFAR-100 Object Recognition |
| Dataset Splits | Yes | λ, γ and the schedule were selected once based on validation set performance on the CIFAR-10 dataset, and kept fixed for all experiments. |
| Hardware Specification | No | The paper mentions 'NVIDIA Corporation for their donation of GPUs' but does not specify exact GPU models or other detailed hardware specifications for running experiments. |
| Software Dependencies | No | The paper mentions 'Experiments were conducted using Caffe [33] and Brainstorm (https://github.com/IDSIA/brainstorm) frameworks' but does not specify their version numbers or other software dependencies with version numbers. |
| Experiment Setup | Yes | All networks were trained using SGD with momentum. ... hyperparameters: initial learning rate, momentum, learning rate exponential decay factor & activation function (either rectified linear or tanh). For highway networks, an additional hyperparameter was the initial value for the transform gate bias (between -1 and -10). |