Initialization of ReLUs for Dynamical Isometry
Authors: Rebekka Burkholz, Alina Dubatovka
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We train fully-connected Re LU feed forward networks of different depth consisting of L = 1, . . . , 10 hidden layers with the same number of neurons Nl = N = 100, 300, 500 and an additional softmax classification layer on MNIST [10] and CIFAR-10 [9] to compare three different initialization schemes: the standard He initialization and our two proposals in Sec. 3, i.e., GSM and orthogonal weights. |
| Researcher Affiliation | Academia | Rebekka Burkholz Department of Biostatistics Harvard T.H. Chan School of Public Health 655 Huntington Avenue, Boston, MA 02115 rburkholz@hsph.harvard.edu Alina Dubatovka Department of Computer Science ETH Zurich Universitätstrasse 6, 8092 Zurich alina.dubatovka@inf.ethz.ch |
| Pseudocode | No | The paper does not contain any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any explicit statement about releasing source code for the described methodology or a link to a code repository. |
| Open Datasets | Yes | We train fully-connected Re LU feed forward networks of different depth... on MNIST [10] and CIFAR-10 [9] |
| Dataset Splits | No | The paper uses MNIST and CIFAR-10 datasets but does not explicitly provide details about training/validation/test dataset splits, specific percentages, or how samples were divided for reproducibility. |
| Hardware Specification | Yes | Each experiment on MNIST was run on 1 Nvidia GTX 1080 Ti GPU, while each experiment on CIFAR-10 was performed on 4 Nvidia GTX 1080 Ti GPUs. |
| Software Dependencies | No | The paper does not specify the version numbers for any software dependencies, libraries, or frameworks used in the experiments. |
| Experiment Setup | Yes | We train fully-connected Re LU feed forward networks of different depth consisting of L = 1, . . . , 10 hidden layers with the same number of neurons Nl = N = 100, 300, 500 and an additional softmax classification layer... We focus on minimizing the cross-entropy by Stochastic Gradient Descent (SGD) without batch normalization or any data augmentation techniques... we adapt the learning rate to (0.0001 + 0.003 exp( step/104))/L for MNIST and (0.00001 + 0.0005 exp( step/104))/L for CIFAR-10 for 104 SGD steps with a batch size of 100 in all cases. |