Practical Quasi-Newton Methods for Training Deep Neural Networks

Authors: Donald Goldfarb, Yi Ren, Achraf Bahamou

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In tests on autoencoder feed-forward neural network models with either nine or thirteen layers applied to three datasets, our methods outperformed or performed comparably to KFAC and state-of-the-art first-order stochastic methods. and 6 Experiments
Researcher Affiliation Academia Donald Goldfarb, Yi Ren, Achraf Bahamou Department of Industrial Engineering and Operations Research Columbia University New York, NY 10027 {goldfarb, yr2322, ab4689}@columbia.edu
Pseudocode Yes Algorithm 1 Forward and backward pass of DNN for a single data-point and Algorithm 2 High-level summary of K-BFGS / K-BFGS(L)
Open Source Code Yes Code is available at https://github.com/renyiryry/kbfgs_neurips2020_public.
Open Datasets Yes We tested K-BFGS and K-BFGS(L), as well as KFAC, Adam/RMSprop and SGD-m (SGD with momentum) on three autoencoder problems, namely, MNIST [25], FACES, and CURVES, which are used in e.g. [22, 29, 30], except that we replaced the sigmoid activation with Re LU.
Dataset Splits No After each epoch, the training loss/testing error from the whole training/testing set is reported (the time for computing this loss is not included in the plots).
Hardware Specification Yes Results were obtained on a machine with 8 x Intel(R) Xeon(R) CPU @ 2.30GHz and 1 x NVIDIA Tesla P100.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers (e.g., library names like PyTorch or TensorFlow with their specific versions).
Experiment Setup Yes To obtain the results in Figure 1, we first did a grid-search on (learning rate, damping) pairs for all algorithms (except for SGD-m, whose grid-search was only on learning rate), where damping refers to λ for K-BFGS/K-BFGS(L)/KFAC, and ϵ for RMSprop/Adam. We then selected the best (learning rate, damping) pairs with the lowest training loss encountered. The range for the grid-search and the best HP values (as well as other fixed HP values) are listed in Section D in the appendix.