Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods
Authors: Jascha Sohl-Dickstein, Ben Poole, Surya Ganguli
ICML 2014 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimentally demonstrate improved convergence on seven diverse optimization problems. |
| Researcher Affiliation | Collaboration | Jascha Sohl-Dickstein JASCHA@{STANFORD.EDU,KHANACADEMY.ORG} Ben Poole POOLE@CS.STANFORD.EDU Surya Ganguli SGANGULI@STANFORD.EDU |
| Pseudocode | No | The paper describes the algorithm using numbered steps and mathematical equations, but it is not presented in a formal pseudocode block or labeled as such. |
| Open Source Code | Yes | Open source code which implements the proposed technique and all competing optimizers, and which directly generates the plots in Figures 1, 2, and 3, is provided at https://github.com/Sohl-Dickstein/Sum-of-Functions-Optimizer. |
| Open Datasets | Yes | A logistic regression objective, chosen to be the same as one used in (Roux et al., 2012). A contractive autoencoder with 784 visible units, and 256 hidden units, similar to the one in (Rifai et al., 2011). An Independent Components Analysis (ICA) (Bell & Sejnowski, 1995) model with Student s t-distribution prior. An Ising model / Hopfield network trained using code from (Hillar et al., 2012) implementing MPF (Sohl Dickstein et al., 2011b;a). A multilayer perceptron with a similar architecture to (Hinton et al., 2012), with layer sizes of 784, 1200, 1200, and 10. Training used Theano (Bergstra & Breuleux, 2010). A deep convolutional network with max pooling and rectified linear units, similar to (Goodfellow & Warde Farley, 2013a), with two convolutional layers with 48 and 128 units, and one fully connected layer with 240 units. Training used Theano and Pylearn2 (Goodfellow & Warde-Farley, 2013b). In Figure 4, a twelve layer neural network was trained on cross entropy reconstruction error for the CURVES dataset. [...] trained on MNIST digits |
| Dataset Splits | No | The paper mentions dividing training data into minibatches (N=100) or chunks (for Hessian-free), but does not specify explicit train/validation/test splits, percentages, or methodology for reproducibility. |
| Hardware Specification | Yes | CPU indicates that all computations were performed on a 2012 Intel i7-3970X CPU (6 cores, 3.5 GHz). GPU indicates that subspace projection was performed on a Ge Force GTX 660 Ti GPU. |
| Software Dependencies | No | The paper mentions software like Theano and Pylearn2, and that the algorithm is released as Python and MATLAB packages, but it does not specify version numbers for these software dependencies (e.g., 'Theano 0.x' or 'Pylearn2 0.y'). |
| Experiment Setup | Yes | For SAG, SGD, and ADAGrad the hyperparameter was chosen by a grid search. The best hyperparameter value, and the hyperparameter values immediately larger and smaller in the grid search, are shown in the plots and legends for each model in Figure 3. In SGD+momentum the two hyperparameters for both step size and momentum coefficient were chosen by a grid search, but only the best parameter values are shown. The grid-searched momenta were 0.5, 0.9, 0.95, and 0.99, and the grid-searched step lengths were all integer powers of ten between 10 5 and 102. For all other experiments and optimizers the training data was divided into N = 100 minibatches (or subfunctions). |