Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Convex Geometry and Duality of Over-parameterized Neural Networks

Authors: Tolga Ergen, Mert Pilanci

JMLR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We then run our approach in Theorem 11, i.e., denoted as Theory and GD on these datasets. In Figure 10, we plot the mean test accuracy (solid lines) of each algorithm along with a one standard deviation confidence band (shaded regions). As illustrated in this example, our approach achieves slightly better generalization performance compared to GD. We also visualize the sample data distributions and the corresponding function fits in Figure 10a, where we provide an example to show the agreement between the solutions found by our approach and GD. We then consider classification tasks and report the performance of the algorithms on MNIST (Le Cun) and CIFAR-10 (Krizhevsky et al., 2014). In order to verify our results in Theorem 15, we run 5 SGD trials with independent initializations for the network parameters, where we use subsampled versions of the datasets. As illustrated in Figure 11 and 12, the network constructed using the closed-form solution achieves the lowest training objective and highest test accuracy for both datasets.
Researcher Affiliation Academia Tolga Ergen EMAIL Department of Electrical Engineering Stanford University Stanford, CA 94305, USA Mert Pilanci EMAIL Department of Electrical Engineering Stanford University Stanford, CA 94305, USA
Pseudocode Yes We also provide the full algorithm in Algorithm 1. Algorithm 1 Cutting Plane based Training Algorithm for Two-Layer NNs (without bias) ...The complete algorithm is also presented in Algorithm 2. Algorithm 2 Convex-RF ...Algorithm 3 Cutting Plane based Training Algorithm for Two-Layer NNs (with bias)
Open Source Code No The paper does not contain any explicit statements about releasing their code, nor does it provide a link to a code repository for the methodology described.
Open Datasets Yes We then consider classification tasks and report the performance of the algorithms on MNIST (Le Cun) and CIFAR-10 (Krizhevsky et al., 2014). We also evaluate the performances on several regression datasets, namely Bank, Boston Housing, California Housing, Elevators, Stock (Torgo), and the Twenty Newsgroups text classification dataset (Mitchell and Learning, 1997). We also remark that all the datasets we use are publicly available and further information, e.g., training and test sizes, can be obtained through the provided references (Le Cun; Krizhevsky et al., 2014; Torgo; new).
Dataset Splits No For the synthetic dataset, the paper states: 'we generate multiple datasets with nonoverlapping training and test splits.' This is a general statement without specific percentages or counts. For MNIST and CIFAR-10, the captions for Figures 11 and 12 mention: 'where (n, d) = (200, 250), K = 10, β = 10-3, m = 100' and 'where (n, d) = (60, 60), K = 10, β = 10-3, m = 100' respectively, indicating subsampled versions, but not the train/test/validation split ratios. It also refers to external references for 'training and test sizes' for some datasets, but does not state the specific splits used in their experiments in the main text.
Hardware Specification No The paper does not mention any specific hardware (e.g., GPU models, CPU types, memory specifications) used for running its experiments.
Software Dependencies No In order to solve the convex optimization problems in our approach, we use CVX (Grant and Boyd, 2014). However, notice that when dealing with large datasets, e.g., CIFAR-10, plain CVX solvers might need significant amount of memory. In order to circumvent these issues, we use SPGL1 (van den Berg and Friedlander, 2007) and Super SCS (Themelis and Patrinos, 2019) for large datasets. While these tools are mentioned, specific version numbers are not provided.
Experiment Setup Yes Figure 11: Training and test performance of 5 independent SGD trials on whitened and sampled MNIST, where (n, d) = (200, 250), K = 10, β = 10-3, m = 100 and we use squared loss with one hot encoding. For the method denoted as Theory, we use the layer weights in Theorem 15. Figure 12: Training and test performance of 5 independent SGD trials on whitened and sampled CIFAR-10, where (n, d) = (60, 60), K = 10, β = 10-3, m = 100 and we use squared loss with one hot encoding. For Theory, we use the layer weights in Theorem 15. For all the experiments, we use the regularization term (also known as weight decay) to let the algorithms generalize well on unseen data (Krogh and Hertz, 1992).