Entropy-SGD: Biasing Gradient Descent Into Wide Valleys
Authors: Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, Riccardo Zecchina
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on convolutional and recurrent neural networks demonstrate that Entropy-SGD compares favorably to state-of-the-art techniques in terms of generalization error and training time. |
| Researcher Affiliation | Collaboration | 1 Computer Science Department, University of California, Los Angeles 2 Department of Electrical and Computer Engineering, New York University 3 Courant Institute of Mathematical Sciences, New York University 4 Facebook AI Research, New York 5 Dipartimento di Scienza Applicata e Tecnologia, Politecnico di Torino 6 Microsoft Research New England, Cambridge |
| Pseudocode | Yes | Algorithm 1: Entropy-SGD algorithm |
| Open Source Code | No | The paper mentions using a tool with a GitHub link (https://github.com/HIPS/autograd) but does not provide specific access to the authors' own implementation code for Entropy-SGD. |
| Open Datasets | Yes | Our experiments on convolutional and recurrent neural networks demonstrate that Entropy-SGD compares favorably to state-of-the-art techniques in terms of generalization error and training time. |
| Dataset Splits | Yes | This dataset contains about one million words divided into a training set of about 930,000 words, a validation set of 74,000 words and 82,000 words with a vocabulary of size 10,000. |
| Hardware Specification | No | No specific hardware details such as GPU/CPU models, memory, or detailed computer specifications used for running experiments are provided. |
| Software Dependencies | No | The paper mentions software like 'autograd', 'Adam', 'Nesterov’s momentum' and implies usage of deep learning frameworks but does not specify any version numbers for these software dependencies. |
| Experiment Setup | Yes | We train for 100 epochs with Adam and a learning rate of 10-3 that drops by a factor of 5 after every 30 epochs to obtain an average error of 1.39 ± 0.03% and 0.51 ± 0.01% for mnistfc and Le Net respectively, over 5 independent runs. For both these networks, we train Entropy-SGD for 5 epochs with L = 20 and reduce the dropout probability (0.15 for mnistfc and 0.25 for Le Net). The learning rate of the SGLD updates is fixed to η* = 0.1 while the outer loop’s learning rate is set to η = 1 and drops by a factor of 10 after the second epoch; we use Nesterov’s momentum for both loops. The thermal noise in SGLD updates (line 5 of Alg. 1) is set to 10^-3. We use an exponentially increasing value of γ for scoping, the initial value of the scope is set to γ = 10^-4 and this increases by a factor of 1.001 after each parameter update. |