Information Geometry of Orthogonal Initializations and Training
Authors: Piotr Aleksander Sokół, Il Memming Park
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We first verify the bound derived in section 3 for networks with random orthogonal weights. We then numerically investigate the behavior of the maximum FIM eigenvalue during training with particular attention being paid to the possible benefits of maintaining orthogonality or near orthogonality during optimization in relation to unconstrained networks. Following Pennington et al. (2017) we trained a 200 layer tanh network on CIFAR-10 and SVHN1 and we set the width of each layer to be N = 400 and chose the σW, σb in such a way to ensure that mean singular value of the input-output Jacobian concentrates on 1 but s2 max varies as a function of q (see Fig. 2). We considered four different critical initializations with q = 10 4, 1 64, 1 2, 8 , which differ both in spread of the singular values as well as in the resulting training speed and final test accuracy as reported by (Pennington et al., 2017). |
| Researcher Affiliation | Academia | Piotr Aleksander Sokół and Il Memming Park Department of Neurobiology and Behavior Departments of Applied Mathematics and Statistics, and Electrical and Computer Engineering Institutes for Advanced Computing Science and AI-driven Discovery and Innovation Stony Brook University, Stony Brook, NY 11733 {memming.park, piotr.sokol}@stonybrook.edu |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Code available on: https://github.com/Piotr Sokol/info-geom |
| Open Datasets | Yes | Following Pennington et al. (2017) we trained a 200 layer tanh network on CIFAR-10 and SVHN1 |
| Dataset Splits | No | The paper mentions 'maximizing validation set accuracy' and discusses 'training loss' and 'test accuracy', but it does not specify the exact percentages or sample counts for the training, validation, or test splits. It does not reference predefined splits with citations nor describe a splitting methodology. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used to run the experiments. |
| Software Dependencies | No | The paper mentions using 'a Riemannian version of ADAM (Kingma & Ba, 2015)' but does not provide specific version numbers for any software, libraries, or frameworks used. |
| Experiment Setup | Yes | Following Pennington et al. (2017) we trained a 200 layer tanh network on CIFAR-10 and SVHN1 and we set the width of each layer to be N = 400 and chose the σW, σb in such a way to ensure that mean singular value of the input-output Jacobian concentrates on 1 but s2 max varies as a function of q (see Fig. 2). We considered four different critical initializations with q = 10 4, 1 64, 1 2, 8 , which differ both in spread of the singular values as well as in the resulting training speed and final test accuracy as reported by (Pennington et al., 2017). (...) All networks were trained with a minibatch size of 1000. (...) The initial learning rates for all the groups, as well as the non-orthogonality penalty (see 43) for Oblique networks were chosen via Bayesian optimization, maximizing validation set accuracy after 50 epochs. |