A Progressive Batching L-BFGS Method for Machine Learning

Authors: Raghu Bollapragada, Jorge Nocedal, Dheevatsa Mudigere, Hao-Jun Shi, Ping Tak Peter Tang

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We report numerical tests on large-scale logistic regression and deep neural network training tasks that indicate that our method is robust and efficient, and has good generalization properties.
Researcher Affiliation Collaboration 1Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL, USA 2Intel Corporation, Bangalore, India 3Intel Corporation, Santa Clara, CA, USA.
Pseudocode Yes Algorithm 1 Progressive Batching L-BFGS Method Input: Initial iterate x0, initial sample size |S0|; Initialization: Set k 0 Repeat until convergence: 1: Sample Sk {1, , N} with sample size |Sk|
Open Source Code No The paper does not provide explicit statements or links indicating that the source code for their methodology is publicly available.
Open Datasets Yes We consider the 8 datasets listed in the supplement. An approximation R of the optimal function value is computed for each problem by running the full batch L-BFGS method until R(xk) 10 8. Training error is defined as R(xk) R , where R(xk) is evaluated over the training set; test loss is evaluated over the test set without the ℓ2 regularization term. ... (i) a small convolutional neural network on CIFAR-10 (C) (Krizhevsky, 2009), (ii) an Alex Net-like convolutional network on MNIST and CIFAR-10 (A1, A2, respectively) (Le Cun et al., 1998; Krizhevsky et al., 2012)
Dataset Splits Yes Training error is defined as R(xk) R , where R(xk) is evaluated over the training set; test loss is evaluated over the test set without the ℓ2 regularization term. ... SG and Adam are tuned using a development-based decay (devdecay) scheme, which track the best validation loss at each epoch and reduces the steplength by a constant factor δ if the validation loss does not improve after e epochs.
Hardware Specification No The paper does not provide specific details on the hardware (e.g., CPU, GPU models, or memory) used for running the experiments.
Software Dependencies No The paper mentions that networks were implemented in 'Py Torch' but does not provide a specific version number for it or other software dependencies.
Experiment Setup Yes For the batch size control test (7), we choose θ = 0.9 in the logistic regression experiments, and θ is a tunable parameter chosen in the interval [0.9, 3] in the neural network experiments. The constant c1 in (16) is set to c1 = 10 4. For L-BFGS, we set the memory as m = 10. We skip the quasi-Newton update if the following curvature condition is not satisfied: y T k sk > ϵ sk 2, with ϵ = 10 2. The initial Hessian matrix Hk 0 in the L-BFGS recursion at each iteration is chosen as γk I where γk = y T k sk/y T k yk. ... In all our experiments, we initialize the batch size as |S0| = 512 in the PBQN method, and fix the batch size to |Sk| = 128 for SG and Adam.