Practical Quasi-Newton Methods for Training Deep Neural Networks
Authors: Donald Goldfarb, Yi Ren, Achraf Bahamou
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In tests on autoencoder feed-forward neural network models with either nine or thirteen layers applied to three datasets, our methods outperformed or performed comparably to KFAC and state-of-the-art first-order stochastic methods. and 6 Experiments |
| Researcher Affiliation | Academia | Donald Goldfarb, Yi Ren, Achraf Bahamou Department of Industrial Engineering and Operations Research Columbia University New York, NY 10027 {goldfarb, yr2322, ab4689}@columbia.edu |
| Pseudocode | Yes | Algorithm 1 Forward and backward pass of DNN for a single data-point and Algorithm 2 High-level summary of K-BFGS / K-BFGS(L) |
| Open Source Code | Yes | Code is available at https://github.com/renyiryry/kbfgs_neurips2020_public. |
| Open Datasets | Yes | We tested K-BFGS and K-BFGS(L), as well as KFAC, Adam/RMSprop and SGD-m (SGD with momentum) on three autoencoder problems, namely, MNIST [25], FACES, and CURVES, which are used in e.g. [22, 29, 30], except that we replaced the sigmoid activation with Re LU. |
| Dataset Splits | No | After each epoch, the training loss/testing error from the whole training/testing set is reported (the time for computing this loss is not included in the plots). |
| Hardware Specification | Yes | Results were obtained on a machine with 8 x Intel(R) Xeon(R) CPU @ 2.30GHz and 1 x NVIDIA Tesla P100. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers (e.g., library names like PyTorch or TensorFlow with their specific versions). |
| Experiment Setup | Yes | To obtain the results in Figure 1, we first did a grid-search on (learning rate, damping) pairs for all algorithms (except for SGD-m, whose grid-search was only on learning rate), where damping refers to λ for K-BFGS/K-BFGS(L)/KFAC, and ϵ for RMSprop/Adam. We then selected the best (learning rate, damping) pairs with the lowest training loss encountered. The range for the grid-search and the best HP values (as well as other fixed HP values) are listed in Section D in the appendix. |