reproducibilityindex.ai

Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods

Authors: Jascha Sohl-Dickstein, Ben Poole, Surya Ganguli

ICML 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experimentally demonstrate improved convergence on seven diverse optimization problems.
Researcher Affiliation	Collaboration	Jascha Sohl-Dickstein JASCHA@{STANFORD.EDU,KHANACADEMY.ORG} Ben Poole POOLE@CS.STANFORD.EDU Surya Ganguli SGANGULI@STANFORD.EDU
Pseudocode	No	The paper describes the algorithm using numbered steps and mathematical equations, but it is not presented in a formal pseudocode block or labeled as such.
Open Source Code	Yes	Open source code which implements the proposed technique and all competing optimizers, and which directly generates the plots in Figures 1, 2, and 3, is provided at https://github.com/Sohl-Dickstein/Sum-of-Functions-Optimizer.
Open Datasets	Yes	A logistic regression objective, chosen to be the same as one used in (Roux et al., 2012). A contractive autoencoder with 784 visible units, and 256 hidden units, similar to the one in (Rifai et al., 2011). An Independent Components Analysis (ICA) (Bell & Sejnowski, 1995) model with Student s t-distribution prior. An Ising model / Hopﬁeld network trained using code from (Hillar et al., 2012) implementing MPF (Sohl Dickstein et al., 2011b;a). A multilayer perceptron with a similar architecture to (Hinton et al., 2012), with layer sizes of 784, 1200, 1200, and 10. Training used Theano (Bergstra & Breuleux, 2010). A deep convolutional network with max pooling and rectiﬁed linear units, similar to (Goodfellow & Warde Farley, 2013a), with two convolutional layers with 48 and 128 units, and one fully connected layer with 240 units. Training used Theano and Pylearn2 (Goodfellow & Warde-Farley, 2013b). In Figure 4, a twelve layer neural network was trained on cross entropy reconstruction error for the CURVES dataset. [...] trained on MNIST digits
Dataset Splits	No	The paper mentions dividing training data into minibatches (N=100) or chunks (for Hessian-free), but does not specify explicit train/validation/test splits, percentages, or methodology for reproducibility.
Hardware Specification	Yes	CPU indicates that all computations were performed on a 2012 Intel i7-3970X CPU (6 cores, 3.5 GHz). GPU indicates that subspace projection was performed on a Ge Force GTX 660 Ti GPU.
Software Dependencies	No	The paper mentions software like Theano and Pylearn2, and that the algorithm is released as Python and MATLAB packages, but it does not specify version numbers for these software dependencies (e.g., 'Theano 0.x' or 'Pylearn2 0.y').
Experiment Setup	Yes	For SAG, SGD, and ADAGrad the hyperparameter was chosen by a grid search. The best hyperparameter value, and the hyperparameter values immediately larger and smaller in the grid search, are shown in the plots and legends for each model in Figure 3. In SGD+momentum the two hyperparameters for both step size and momentum coefﬁcient were chosen by a grid search, but only the best parameter values are shown. The grid-searched momenta were 0.5, 0.9, 0.95, and 0.99, and the grid-searched step lengths were all integer powers of ten between 10 5 and 102. For all other experiments and optimizers the training data was divided into N = 100 minibatches (or subfunctions).