Information Geometry of Orthogonal Initializations and Training

Authors: Piotr Aleksander Sokół, Il Memming Park

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We first verify the bound derived in section 3 for networks with random orthogonal weights. We then numerically investigate the behavior of the maximum FIM eigenvalue during training with particular attention being paid to the possible benefits of maintaining orthogonality or near orthogonality during optimization in relation to unconstrained networks. Following Pennington et al. (2017) we trained a 200 layer tanh network on CIFAR-10 and SVHN1 and we set the width of each layer to be N = 400 and chose the σW, σb in such a way to ensure that mean singular value of the input-output Jacobian concentrates on 1 but s2 max varies as a function of q (see Fig. 2). We considered four different critical initializations with q = 10 4, 1 64, 1 2, 8 , which differ both in spread of the singular values as well as in the resulting training speed and final test accuracy as reported by (Pennington et al., 2017).
Researcher Affiliation Academia Piotr Aleksander Sokół and Il Memming Park Department of Neurobiology and Behavior Departments of Applied Mathematics and Statistics, and Electrical and Computer Engineering Institutes for Advanced Computing Science and AI-driven Discovery and Innovation Stony Brook University, Stony Brook, NY 11733 {memming.park, piotr.sokol}@stonybrook.edu
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes 1Code available on: https://github.com/Piotr Sokol/info-geom
Open Datasets Yes Following Pennington et al. (2017) we trained a 200 layer tanh network on CIFAR-10 and SVHN1
Dataset Splits No The paper mentions 'maximizing validation set accuracy' and discusses 'training loss' and 'test accuracy', but it does not specify the exact percentages or sample counts for the training, validation, or test splits. It does not reference predefined splits with citations nor describe a splitting methodology.
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments.
Software Dependencies No The paper mentions using 'a Riemannian version of ADAM (Kingma & Ba, 2015)' but does not provide specific version numbers for any software, libraries, or frameworks used.
Experiment Setup Yes Following Pennington et al. (2017) we trained a 200 layer tanh network on CIFAR-10 and SVHN1 and we set the width of each layer to be N = 400 and chose the σW, σb in such a way to ensure that mean singular value of the input-output Jacobian concentrates on 1 but s2 max varies as a function of q (see Fig. 2). We considered four different critical initializations with q = 10 4, 1 64, 1 2, 8 , which differ both in spread of the singular values as well as in the resulting training speed and final test accuracy as reported by (Pennington et al., 2017). (...) All networks were trained with a minibatch size of 1000. (...) The initial learning rates for all the groups, as well as the non-orthogonality penalty (see 43) for Oblique networks were chosen via Bayesian optimization, maximizing validation set accuracy after 50 epochs.