Minimum norm interpolation by perceptra: Explicit regularization and implicit bias

Authors: Jiyoung Park, Ian Pelakh, Stephan Wojtowytsch

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We investigate how shallow Re LU networks interpolate between known regions. Our analysis shows that empirical risk minimizers converge to a minimum norm interpolant as the number of data points and parameters tends to infinity when a weight decay regularizer is penalized with a coefficient which vanishes at a precise rate as the network width and the number of data points grow. With and without explicit regularization, we numerically study the implicit bias of common optimization algorithms towards known minimum norm interpolants.Training a neural network generally corresponds to solving a non-convex minimization problem. While we provide convergence guarantees for empirical risk minimizers, in general there is no guarantee that a training algorithm finds a global minimizer of an empirical risk functional. Even if convergence holds, it is unclear which minimizer is selected (in the overparametrized regime, where the set of minimizers is a high-dimensional manifold). In settings where the minimum norm interpolant is known (Section 3), we compare numerical solutions to theoretical predictions to better understand (1) the predictive power of theoretically studying empirical risk minimizers and (2) the implicit bias of different optimization algorithms (Sections 4 and 5).
Researcher Affiliation Academia Jiyoung Park Department of Statistics Texas A&M University wldyddl5510@tamu.edu Ian Pelakh Department of Mathematics Iowa State University ispelakh@iastate.edu Stephan Wojtowytsch Department of Mathematics University of Pittsburgh s.woj@pitt.edu
Pseudocode No The paper describes mathematical proofs and numerical experiment procedures in prose, but it does not include any formal pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about the release of its source code or a link to a code repository.
Open Datasets No The paper describes custom data generation processes for its experiments (e.g., 'we select f (x) = |x|', 'Data is generated from a distribution µ = µ1 + µ2 + µ3') but does not provide access information for a public dataset.
Dataset Splits No The paper mentions evaluating performance on 'unseen data' but does not provide specific details on how datasets were split into training, validation, or test sets (e.g., percentages, sample counts, or predefined splits).
Hardware Specification No All experiments were performed on a free version of google colab or the experimenters personal computers. One run of the model takes below fifteen minutes on a single graphics processing unit.
Software Dependencies No The paper mentions 'Py Torch default hyperparameters' in the context of the ADAM optimizer, implying the use of PyTorch, but it does not specify any version numbers for PyTorch or other software dependencies.
Experiment Setup Yes In all experiments in Dimensions 3 and 15, the following hyperparameter settings were used unless otherwise indicated: 1. Normal Xavier initialization with gain α = 2 2. SGD: Learning rate = 10 2 (Dimension 15), 10 3 (Dimension 3). 3. Momentum-SGD: Learning rate = 10 3 and momentum µ = 0.99 4. ADAM: Learning rate = 10 3 and Py Torch default hyperparameters for β1 = 0.9, β2 = 0.999, ε = 10 8. For experiments in Dimension 31, we drop the learning rate for ADAM after 50 of 150 epochs by a factor of 10 and for Momentum-SGD by a factor of 10 after 100 epochs.