Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks
Authors: Wei Hu, Lechao Xiao, Jeffrey Pennington
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we provide empirical evidence to support the results in Sections 4 and 5. To study how depth and width affect convergence speed of gradient descent under orthogonal and Gaussian initialization schemes, we train a family of linear networks with their widths ranging from 10 to 1000 and depths from 1 to 700, on a fixed synthetic dataset (X, Y ). |
| Researcher Affiliation | Collaboration | Wei Hu Princeton University huwei@cs.princeton.edu Lechao Xiao Google Brain xlc@google.com Jeffrey Pennington Google Brain jpennin@google.com |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement or link indicating the release of open-source code for the methodology described. |
| Open Datasets | No | We choose X R1024 16 and W R10 1024, and set Y = W X. Entries in X and W are drawn i.i.d. from N(0, 1). |
| Dataset Splits | No | The paper mentions a 'fixed synthetic dataset' and 'training loss' but does not provide explicit details about training, validation, or test splits. |
| Hardware Specification | No | The paper does not specify any details about the hardware used for the experiments. |
| Software Dependencies | No | The paper does not specify any software dependencies or their version numbers. |
| Experiment Setup | Yes | To study how depth and width affect convergence speed of gradient descent under orthogonal and Gaussian initialization schemes, we train a family of linear networks with their widths ranging from 10 to 1000 and depths from 1 to 700, on a fixed synthetic dataset (X, Y ). Each network is trained using gradient descent staring from both Gaussian and orthogonal initializations. |