Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

Authors: Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, Riccardo Zecchina

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on convolutional and recurrent neural networks demonstrate that Entropy-SGD compares favorably to state-of-the-art techniques in terms of generalization error and training time.
Researcher Affiliation Collaboration 1 Computer Science Department, University of California, Los Angeles 2 Department of Electrical and Computer Engineering, New York University 3 Courant Institute of Mathematical Sciences, New York University 4 Facebook AI Research, New York 5 Dipartimento di Scienza Applicata e Tecnologia, Politecnico di Torino 6 Microsoft Research New England, Cambridge
Pseudocode Yes Algorithm 1: Entropy-SGD algorithm
Open Source Code No The paper mentions using a tool with a GitHub link (https://github.com/HIPS/autograd) but does not provide specific access to the authors' own implementation code for Entropy-SGD.
Open Datasets Yes Our experiments on convolutional and recurrent neural networks demonstrate that Entropy-SGD compares favorably to state-of-the-art techniques in terms of generalization error and training time.
Dataset Splits Yes This dataset contains about one million words divided into a training set of about 930,000 words, a validation set of 74,000 words and 82,000 words with a vocabulary of size 10,000.
Hardware Specification No No specific hardware details such as GPU/CPU models, memory, or detailed computer specifications used for running experiments are provided.
Software Dependencies No The paper mentions software like 'autograd', 'Adam', 'Nesterov’s momentum' and implies usage of deep learning frameworks but does not specify any version numbers for these software dependencies.
Experiment Setup Yes We train for 100 epochs with Adam and a learning rate of 10-3 that drops by a factor of 5 after every 30 epochs to obtain an average error of 1.39 ± 0.03% and 0.51 ± 0.01% for mnistfc and Le Net respectively, over 5 independent runs. For both these networks, we train Entropy-SGD for 5 epochs with L = 20 and reduce the dropout probability (0.15 for mnistfc and 0.25 for Le Net). The learning rate of the SGLD updates is fixed to η* = 0.1 while the outer loop’s learning rate is set to η = 1 and drops by a factor of 10 after the second epoch; we use Nesterov’s momentum for both loops. The thermal noise in SGLD updates (line 5 of Alg. 1) is set to 10^-3. We use an exponentially increasing value of γ for scoping, the initial value of the scope is set to γ = 10^-4 and this increases by a factor of 1.001 after each parameter update.