Training Deep Networks without Learning Rates Through Coin Betting
Authors: Francesco Orabona, Tatiana Tommasi
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Theoretical convergence is proven for convex and quasi-convex functions and empirical evidence shows the advantage of our algorithm over popular stochastic gradient algorithms.We run experiments on various datasets and architectures, comparing COCOB with some popular stochastic gradient learning algorithms |
| Researcher Affiliation | Academia | Francesco Orabona Department of Computer Science Stony Brook University Stony Brook, NY francesco@orabona.com Tatiana Tommasi Department of Computer, Control, and Management Engineering Sapienza, Rome University, Italy tommasi@dis.uniroma1.it |
| Pseudocode | Yes | Algorithm 1 COntinuous COin Betting COCOB Algorithm 2 COCOB-Backprop |
| Open Source Code | Yes | We implemented4 COCOB (following Algorithm 2) in Tensorflow [Abadi et al., 2015] and we used the implementations of the other algorithms provided by this deep learning framework. The footnote 4 provides a link: https://github.com/bremen79/cocob |
| Open Datasets | Yes | Digits Recognition. As a first test, we tackle handwritten digits recognition using the MNIST dataset [Le Cun et al., 1998a]. Object Classification. We use the popular CIFAR-10 dataset [Krizhevsky, 2009] to classify 32 32 RGB images across 10 object categories. Word-level Prediction with RNN. Here we train a Recurrent Neural Network (RNN) on a language modeling task. Specifically, we conduct word-level prediction experiments on the Penn Tree Bank (PTB) dataset [Marcus et al., 1993] |
| Dataset Splits | Yes | It contains 28 28 grayscale images with 60k training data, and 10k test samples. The dataset has 60k images in total, split into a training/test set of 50k/10k samples. We adopted the medium LSTM [Hochreiter and Schmidhuber, 1997] network architecture described in Zaremba et al. [2014]: it has 2 layers with 650 units per layer and parameters initialized uniformly in [ 0.05, 0.05], a dropout of 50% is applied on the non-recurrent connections, and the norm of the gradients (normalized by mini-batch size = 20) is clipped at 5. (For PTB, it mentions 929k training words and 73k validation words, implying a split for validation) |
| Hardware Specification | Yes | The authors thank the Stony Brook Research Computing and Cyberinfrastructure, and the Institute for Advanced Computational Science at Stony Brook University for access to the high-performance Sea Wulf computing system, which was made possible by a $1.4M National Science Foundation grant (#1531492). |
| Software Dependencies | No | The paper mentions: 'We implemented4 COCOB (following Algorithm 2) in Tensorflow [Abadi et al., 2015]'. It only mentions TensorFlow without a specific version number. It also lists other algorithms (Ada Grad, RMSProp, Adadelta, Adam) but does not specify their software versions. |
| Experiment Setup | Yes | For the first network we reproduce the structure described in the multi-layer experiment of [Kingma and Ba, 2015]: it has two fully connected hidden layers with 1000 hidden units each and Re LU activations, with mini-batch size of 100. The weights are initialized with a centered truncated normal distribution and standard deviation 0.1, the same small value 0.1 is also used as initialization for the bias. For CIFAR-10: We use a batch size of 128 and the input images are simply pre-processed by whitening. For PTB: it has 2 layers with 650 units per layer and parameters initialized uniformly in [ 0.05, 0.05], a dropout of 50% is applied on the non-recurrent connections, and the norm of the gradients (normalized by mini-batch size = 20) is clipped at 5. |