reproducibilityindex.ai

Information Geometry of Orthogonal Initializations and Training

Authors: Piotr Aleksander Sokół, Il Memming Park

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We ﬁrst verify the bound derived in section 3 for networks with random orthogonal weights. We then numerically investigate the behavior of the maximum FIM eigenvalue during training with particular attention being paid to the possible beneﬁts of maintaining orthogonality or near orthogonality during optimization in relation to unconstrained networks. Following Pennington et al. (2017) we trained a 200 layer tanh network on CIFAR-10 and SVHN1 and we set the width of each layer to be N = 400 and chose the σW, σb in such a way to ensure that mean singular value of the input-output Jacobian concentrates on 1 but s2 max varies as a function of q (see Fig. 2). We considered four different critical initializations with q = 10 4, 1 64, 1 2, 8 , which differ both in spread of the singular values as well as in the resulting training speed and ﬁnal test accuracy as reported by (Pennington et al., 2017).
Researcher Affiliation	Academia	Piotr Aleksander Sokół and Il Memming Park Department of Neurobiology and Behavior Departments of Applied Mathematics and Statistics, and Electrical and Computer Engineering Institutes for Advanced Computing Science and AI-driven Discovery and Innovation Stony Brook University, Stony Brook, NY 11733 {memming.park, piotr.sokol}@stonybrook.edu
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	1Code available on: https://github.com/Piotr Sokol/info-geom
Open Datasets	Yes	Following Pennington et al. (2017) we trained a 200 layer tanh network on CIFAR-10 and SVHN1
Dataset Splits	No	The paper mentions 'maximizing validation set accuracy' and discusses 'training loss' and 'test accuracy', but it does not specify the exact percentages or sample counts for the training, validation, or test splits. It does not reference predefined splits with citations nor describe a splitting methodology.
Hardware Specification	No	The paper does not provide specific details about the hardware used to run the experiments.
Software Dependencies	No	The paper mentions using 'a Riemannian version of ADAM (Kingma & Ba, 2015)' but does not provide specific version numbers for any software, libraries, or frameworks used.
Experiment Setup	Yes	Following Pennington et al. (2017) we trained a 200 layer tanh network on CIFAR-10 and SVHN1 and we set the width of each layer to be N = 400 and chose the σW, σb in such a way to ensure that mean singular value of the input-output Jacobian concentrates on 1 but s2 max varies as a function of q (see Fig. 2). We considered four different critical initializations with q = 10 4, 1 64, 1 2, 8 , which differ both in spread of the singular values as well as in the resulting training speed and ﬁnal test accuracy as reported by (Pennington et al., 2017). (...) All networks were trained with a minibatch size of 1000. (...) The initial learning rates for all the groups, as well as the non-orthogonality penalty (see 43) for Oblique networks were chosen via Bayesian optimization, maximizing validation set accuracy after 50 epochs.