The Implicit Bias of Minima Stability in Multivariate Shallow ReLU Networks
Authors: Mor Shpigel Nacson, Rotem Mulayoff, Greg Ongie, Tomer Michaeli, Daniel Soudry
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the tightness of our bound on the MNIST dataset, and show that it accurately captures the behavior of the solutions as a function of the step size. Additionally, we prove a depth separation result on the approximation power of Re LU networks corresponding to stable minima of the loss. Specifically, although shallow Re LU networks are universal approximators, we prove that stable shallow networks are not. Namely, there is a function that cannot be well-approximated by stable single hidden-layer Re LU networks trained with a nonvanishing step size. This is while the same function can be realized as a stable two hidden-layer Re LU network. Finally, we prove that if a function is sufficiently smooth (in a Sobolev sense) then it can be approximated arbitrarily well using shallow Re LU networks that correspond to stable solutions of gradient descent. |
| Researcher Affiliation | Academia | Mor Shpigel Nacson & Rotem Mulayoff Electrical & Computer Engineering, Technion Greg Ongie Mathematical and Statistical Sciences Marquette University Tomer Michaeli & Daniel Soudry Electrical & Computer Engineering, Technion |
| Pseudocode | No | No pseudocode or algorithm blocks were found. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | Yes | Next, we present an experiment with binary classification on MNIST (Le Cun, 1998) using SGD. |
| Dataset Splits | Yes | For the validation set we used 4000 images from the remaining samples in each class. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details (e.g., library or solver names with version numbers) needed to replicate the experiment. |
| Experiment Setup | Yes | We trained a single hidden-layer Re LU network with k = 40 neurons to the data using GD with various step sizes (runs were stopped when the loss dropped bellow 10^-8). ... We trained a single hidden-layer Re LU network with k = 200 neurons using SGD with batch size B = 16, and the quadratic loss. ... We ran SGD until the loss dropped below 10^-8 for 2000 consecutive epochs. |