Distributed Second-Order Optimization using Kronecker-Factored Approximations

Authors: Jimmy Ba, Roger Grosse, James Martens

ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experimentally evaluated distributed K-FAC on several large convolutional neural network training tasks involving the CIFAR-10 and Image Net classification datasets.
Researcher Affiliation Collaboration Jimmy Ba University of Toronto jimmy@psi.toronto.edu Roger Grosse University of Toronto rgrosse@cs.toronto.edu James Martens University of Toronto and Google Deep Mind jmartens@cs.toronto.edu
Pseudocode No No pseudocode or algorithm blocks were found.
Open Source Code No The paper states: 'We provide a Tensorflow implementation of our approach which is easy to use and can be applied to many existing codebases without modification.' However, it does not provide a specific link to the source code or an unambiguous statement of public release.
Open Datasets Yes We experimentally evaluated distributed K-FAC on several large convolutional neural network training tasks involving the CIFAR-10 and Image Net classification datasets. (Krizhevsky and Hinton, 2009) (Russakovsky et al., 2015)
Dataset Splits No The paper mentions 'validation curves' and 'validation error' in its figures and discussions (e.g., 'the validation error is often lower than the training error during the first 90% of training'), indicating the use of a validation set, but does not provide specific details on its size, percentage split, or how it was formed.
Hardware Specification Yes Due to computational resource constraints, we used a single GPU server with 8 Nvidia K80 GPUs to simulate a large distributed system. The GPUs were used as gradient workers... with the CPUs acting as a parameter server. The Fisher block inversions were performed on the CPUs in parallel, using as many threads as possible. ... our 16 core Xeon 2.2Ghz CPU.
Software Dependencies No The paper states 'We chose to base our implementation of distributed K-FAC on the Tensor Flow framework (Abadi et al., 2016)' but does not specify any version numbers for TensorFlow or other software dependencies.
Experiment Setup Yes Meta-parameters such as learning rates, damping parameters, and the decay-rate for the secondorder statistics, were optimized carefully by hand for each method. The momentum was fixed to 0.9. ... All the CIFAR-10 experiments use a mini-batch size of 512. ... we used the KL-based step sized selection method described in Section 5 with parameters c0 = 0.01 and ζ = 0.96. The SGD baselines use an exponential learning rate decay schedule with a decay rate of 0.96. Decaying is applied after each half-epoch for distributed K-FAC and SGD+Batch Normalization, and after every two epochs for plain SGD...