reproducibilityindex.ai

Practical Quasi-Newton Methods for Training Deep Neural Networks

Authors: Donald Goldfarb, Yi Ren, Achraf Bahamou

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In tests on autoencoder feed-forward neural network models with either nine or thirteen layers applied to three datasets, our methods outperformed or performed comparably to KFAC and state-of-the-art ﬁrst-order stochastic methods. and 6 Experiments
Researcher Affiliation	Academia	Donald Goldfarb, Yi Ren, Achraf Bahamou Department of Industrial Engineering and Operations Research Columbia University New York, NY 10027 {goldfarb, yr2322, ab4689}@columbia.edu
Pseudocode	Yes	Algorithm 1 Forward and backward pass of DNN for a single data-point and Algorithm 2 High-level summary of K-BFGS / K-BFGS(L)
Open Source Code	Yes	Code is available at https://github.com/renyiryry/kbfgs_neurips2020_public.
Open Datasets	Yes	We tested K-BFGS and K-BFGS(L), as well as KFAC, Adam/RMSprop and SGD-m (SGD with momentum) on three autoencoder problems, namely, MNIST [25], FACES, and CURVES, which are used in e.g. [22, 29, 30], except that we replaced the sigmoid activation with Re LU.
Dataset Splits	No	After each epoch, the training loss/testing error from the whole training/testing set is reported (the time for computing this loss is not included in the plots).
Hardware Specification	Yes	Results were obtained on a machine with 8 x Intel(R) Xeon(R) CPU @ 2.30GHz and 1 x NVIDIA Tesla P100.
Software Dependencies	No	The paper does not provide specific ancillary software details with version numbers (e.g., library names like PyTorch or TensorFlow with their specific versions).
Experiment Setup	Yes	To obtain the results in Figure 1, we ﬁrst did a grid-search on (learning rate, damping) pairs for all algorithms (except for SGD-m, whose grid-search was only on learning rate), where damping refers to λ for K-BFGS/K-BFGS(L)/KFAC, and ϵ for RMSprop/Adam. We then selected the best (learning rate, damping) pairs with the lowest training loss encountered. The range for the grid-search and the best HP values (as well as other ﬁxed HP values) are listed in Section D in the appendix.