Exact Solutions of a Deep Linear Network

Authors: Liu Ziyin, Botao Li, Xiangming Meng

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically demonstrate that close to the origin, the landscape of linear nets can indeed approximate that of nonlinear nets quite well. To compare, we plug in the solution in Theorem 4 to both linear and nonlinear models of the same architecture and compare the loss values at different values of b around b = 0. For simplicity, we only consider the case D = 1. The activation functions we consider are Re LU, Tanh, and Swish (Ramachandran et al., 2017), a modern and differentiable variant of Re LU. See Fig. 2. To demonstrate, we perform a numerical simulation shown in the right panel of Figure 3, where we train D = 2 nonlinear networks with width 32 with SGD on tasks with varying E[xy]. They are done on a single 3080Ti GPU.
Researcher Affiliation Academia Liu Ziyin1, Botao Li2, Xiangming Meng3 1Department of Physics, The University of Tokyo 2Laboratoire de Physique de l Ecole normale supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université de Paris Cité, Paris, France 3Institute for Physics of Intelligence, Graduate School of Science, The University of Tokyo
Pseudocode No The paper contains mathematical derivations and theorems but no explicitly labeled pseudocode or algorithm blocks.
Open Source Code No 3. If you ran experiments... (a) Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No] The experiments are only for demonstration and are straightforward to reproduce following the theory.
Open Datasets Yes Figure 1: Right: Res Net18 on CIFAR10. Using Res Net, one needs to change the dimension of the hidden layer after every bottleneck, and a learnable linear transformation is applied here. Thus, the effective depth of a Res Net would be roughly between the number of its bottlenecks and its total number of blocks. For example, a Res Net18 applied to CIFAR10 often has five bottlenecks and 18 layers in total.
Dataset Splits No The paper mentions running experiments on 'CIFAR10' and 'synthetic tasks' but does not provide specific details on the train, validation, and test dataset splits (e.g., percentages or sample counts) within the main text or appendix.
Hardware Specification Yes They are done on a single 3080Ti GPU.
Software Dependencies No The paper mentions the use of SGD and specific activation functions (Re LU, Tanh, Swish) but does not provide specific version numbers for any software dependencies or libraries used for the experiments.
Experiment Setup Yes The ResNet18 experiment is run on standard CIFAR10 with a standard setup. We run SGD with learning rate 0.05, batch size 128, and a weight decay of 5e-4 and train for 200 epochs. For the synthetic tasks shown in Fig. 3, we use a constant learning rate of 0.01 with Adam and a batch size of 100.