Revealing the Structure of Deep Neural Networks via Convex Duality

Authors: Tolga Ergen, Mert Pilanci

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Here, we present numerical results to verify our theoretical analysis. We first use synthetic datasets generated from a random data matrix with zero mean and identity covariance and the corresponding output vector is obtained via a randomly initialized teacher network8. We first consider a two-layer linear network with W1 R20 50 and W2 R50 5. To prove our claim in Remark 2.1, we train the network using GD with different β. In Figure 2a, we plot the rank of W1 as a function of β, as well as the location of the singular values of YT X using vertical red lines. This shows that the rank of the layer changes when β is equal to one of the singular values, which verifies Remark 2.1. We also consider a four-layer linear network with W1,j R5 50, W2,j R50 30, W3,j R30 40, and W4,j R40 5. We then select different regularization parameters as β1 < β2 < β3 < β4. As illustrated in Figure 2b, β determines the rank of each weight matrix and the rank is same for all the layers, which matches with our results. Moreover, to verify Proposition 3.1, we choose β such that the weights are rank-two. In Figure 3a, we numerically show that all the hidden layer weight matrices have the same operator and Frobenius norms. We also conduct an experiment for a five-layer Re LU network with W1,j R10 50, W2,j R50 40, W3,j R40 30, W4,j R30 20, and w5,j R20 1. Here, we use data such that X = ca T 0 , where c Rn + and a0 Rd. In Figure 3b, we plot the rank of each weight matrix, which converges to one as claimed Proposition 4.1. We also verify our theory on two real benchmark datasets, i.e., MNIST (Le Cun) and CIFAR-10 (Krizhevsky et al., 2014). We first randomly undersample and whitened these datasets. We then convert the labels into one hot encoded form. Then, we consider a ten class classification/regression task using three multi-layer Re LU network architectures with L = 3, 4, 5. For each architecture, we use SGD with momentum for training and compare the training/test performance with the corresponding network constructed via the closed-form solutions (without any sort of training) in Theorem 4.3, i.e., denoted as Theory . In Figure 4, Theory achieves the optimal training objective, which also yields smaller error and higher test accuracy. Thus, we numerically verify the claims in Theorem 4.3.
Researcher Affiliation Academia 1Department of Electrical Engineering, Stanford University, CA, USA. Correspondence to: Tolga Ergen <ergen@stanford.edu>.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any concrete access to source code for the methodology described.
Open Datasets Yes We also verify our theory on two real benchmark datasets, i.e., MNIST (Le Cun) and CIFAR-10 (Krizhevsky et al., 2014).
Dataset Splits No The paper mentions 'training/test performance' but does not specify a validation dataset split.
Hardware Specification No The paper does not provide specific hardware details used for running its experiments.
Software Dependencies No The paper mentions 'Py Torch and TensorfLow' but does not specify version numbers for these or any other software dependencies.
Experiment Setup Yes We then consider a ten class classification/regression task using three multi-layer Re LU network architectures with L = 3, 4, 5. For each architecture, we use SGD with momentum for training and compare the training/test performance with the corresponding network constructed via the closed-form solutions (without any sort of training) in Theorem 4.3, i.e., denoted as Theory . ... with 50 neurons per layer