Distributed Second-Order Optimization using Kronecker-Factored Approximations
Authors: Jimmy Ba, Roger Grosse, James Martens
ICLR 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experimentally evaluated distributed K-FAC on several large convolutional neural network training tasks involving the CIFAR-10 and Image Net classification datasets. |
| Researcher Affiliation | Collaboration | Jimmy Ba University of Toronto jimmy@psi.toronto.edu Roger Grosse University of Toronto rgrosse@cs.toronto.edu James Martens University of Toronto and Google Deep Mind jmartens@cs.toronto.edu |
| Pseudocode | No | No pseudocode or algorithm blocks were found. |
| Open Source Code | No | The paper states: 'We provide a Tensorflow implementation of our approach which is easy to use and can be applied to many existing codebases without modification.' However, it does not provide a specific link to the source code or an unambiguous statement of public release. |
| Open Datasets | Yes | We experimentally evaluated distributed K-FAC on several large convolutional neural network training tasks involving the CIFAR-10 and Image Net classification datasets. (Krizhevsky and Hinton, 2009) (Russakovsky et al., 2015) |
| Dataset Splits | No | The paper mentions 'validation curves' and 'validation error' in its figures and discussions (e.g., 'the validation error is often lower than the training error during the first 90% of training'), indicating the use of a validation set, but does not provide specific details on its size, percentage split, or how it was formed. |
| Hardware Specification | Yes | Due to computational resource constraints, we used a single GPU server with 8 Nvidia K80 GPUs to simulate a large distributed system. The GPUs were used as gradient workers... with the CPUs acting as a parameter server. The Fisher block inversions were performed on the CPUs in parallel, using as many threads as possible. ... our 16 core Xeon 2.2Ghz CPU. |
| Software Dependencies | No | The paper states 'We chose to base our implementation of distributed K-FAC on the Tensor Flow framework (Abadi et al., 2016)' but does not specify any version numbers for TensorFlow or other software dependencies. |
| Experiment Setup | Yes | Meta-parameters such as learning rates, damping parameters, and the decay-rate for the secondorder statistics, were optimized carefully by hand for each method. The momentum was fixed to 0.9. ... All the CIFAR-10 experiments use a mini-batch size of 512. ... we used the KL-based step sized selection method described in Section 5 with parameters c0 = 0.01 and ζ = 0.96. The SGD baselines use an exponential learning rate decay schedule with a decay rate of 0.96. Decaying is applied after each half-epoch for distributed K-FAC and SGD+Batch Normalization, and after every two epochs for plain SGD... |