On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization
Authors: Sanjeev Arora, Nadav Cohen, Elad Hazan
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Theoretical analysis, as well as experiments, show that here depth acts as a preconditioner which may accelerate convergence. Even on simple convex problems such as linear regression with ℓp loss, p > 2, gradient descent can benefit from transitioning to a non-convex overparameterized objective, more than it would from some common acceleration schemes. [...] In this section we put these claims to the test, through a series of empirical evaluations based on Tensor Flow toolbox (Abadi et al. (2016)). For conciseness, many of the details behind our implementation are deferred to Appendix C. [...] Figure 2 shows convergence (training objective per iteration) of gradient descent optimizing depth-2 and depth-3 linear networks, against optimization of a single layer model using the respective preconditioning schemes (Equation 12 with N = 2, 3). |
| Researcher Affiliation | Collaboration | 1Department of Computer Science, Princeton University, Princeton, NJ, USA 2School of Mathematics, Institute for Advanced Study, Princeton, NJ, USA 3Google Brain, USA. |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. It mentions using 'Tensor Flow toolbox' (a third-party tool) and defers 'implementation details' to an appendix, but no explicit statement or link for the authors' own code release. |
| Open Datasets | Yes | The dataset chosen was UCI Machine Learning Repository’s Gas Sensor Array Drift at Different Concentrations (Vergara et al., 2012; Rodriguez-Lujan et al., 2014). Specifically, we used the dataset’s Ethanol problem – a scalar regression task with 2565 examples, each comprising 128 features (one of the largest numeric regression tasks in the repository). |
| Dataset Splits | No | The paper does not explicitly provide specific training/validation/test dataset split information. It mentions using a dataset for 'training objective' and 'test' set implicitly but lacks details on the splits, percentages, or sample counts for each partition, nor does it explicitly mention a 'validation' set. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, processor types, or memory amounts used for running its experiments. It only mentions the use of 'Tensor Flow toolbox'. |
| Software Dependencies | No | The paper mentions 'Tensor Flow toolbox' and 'Sci Py' but does not provide specific version numbers for these or any other ancillary software components, which is required for reproducibility. |
| Experiment Setup | Yes | In all experiments, initial weights were drawn from a zero-mean normal distribution with standard deviation 0.01. Learning rates were found through grid search, with grid {1e-2, 1e-3, 1e-4, 1e-5}. Unless otherwise indicated, weight decay coefficient was set to zero. [...] For the experiments of Figure 4-right, where Adam optimizer was used, we relied on Tensor Flow’s default settings for learning rate (0.001) and β parameters (β1=0.9, β2=0.999). [...] As for the experiment of Figure 5-right, for the MNIST convolutional network tutorial, we used Tensor Flow’s default hyperparameter settings, namely: learning rate 0.01 (constant), dropout rate 0.5, RMSProp optimizer with decay 0.9 and momentum 0.9. Initial weights were drawn from truncated normal distribution with standard deviation 0.1. |