Initialization of ReLUs for Dynamical Isometry

Authors: Rebekka Burkholz, Alina Dubatovka

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We train fully-connected Re LU feed forward networks of different depth consisting of L = 1, . . . , 10 hidden layers with the same number of neurons Nl = N = 100, 300, 500 and an additional softmax classification layer on MNIST [10] and CIFAR-10 [9] to compare three different initialization schemes: the standard He initialization and our two proposals in Sec. 3, i.e., GSM and orthogonal weights.
Researcher Affiliation Academia Rebekka Burkholz Department of Biostatistics Harvard T.H. Chan School of Public Health 655 Huntington Avenue, Boston, MA 02115 rburkholz@hsph.harvard.edu Alina Dubatovka Department of Computer Science ETH Zurich Universitätstrasse 6, 8092 Zurich alina.dubatovka@inf.ethz.ch
Pseudocode No The paper does not contain any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statement about releasing source code for the described methodology or a link to a code repository.
Open Datasets Yes We train fully-connected Re LU feed forward networks of different depth... on MNIST [10] and CIFAR-10 [9]
Dataset Splits No The paper uses MNIST and CIFAR-10 datasets but does not explicitly provide details about training/validation/test dataset splits, specific percentages, or how samples were divided for reproducibility.
Hardware Specification Yes Each experiment on MNIST was run on 1 Nvidia GTX 1080 Ti GPU, while each experiment on CIFAR-10 was performed on 4 Nvidia GTX 1080 Ti GPUs.
Software Dependencies No The paper does not specify the version numbers for any software dependencies, libraries, or frameworks used in the experiments.
Experiment Setup Yes We train fully-connected Re LU feed forward networks of different depth consisting of L = 1, . . . , 10 hidden layers with the same number of neurons Nl = N = 100, 300, 500 and an additional softmax classification layer... We focus on minimizing the cross-entropy by Stochastic Gradient Descent (SGD) without batch normalization or any data augmentation techniques... we adapt the learning rate to (0.0001 + 0.003 exp( step/104))/L for MNIST and (0.00001 + 0.0005 exp( step/104))/L for CIFAR-10 for 104 SGD steps with a batch size of 100 in all cases.