Learning threshold neurons via edge of stability

Authors: Kwangjun Ahn, Sebastien Bubeck, Sinho Chewi, Yin Tat Lee, Felipe Suarez, Yi Zhang

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This lies in stark contrast to practical wisdom and empirical studies, such as the work of J. Cohen et al. (ICLR 2021), which exhibit startling new phenomena (the edge of stability or unstable convergence ) and potential benefits for generalization in the large learning rate regime. Figure 1: Large step sizes are necessary to learn the threshold neuron of a Re LU network (2) for a simple binary classification task (1). We choose d = 200, n = 300, λ = 3, and run gradient descent with the logistic loss. The weights are initialized as a , a+ N(0, 1/(2d)) and b = 0. For each learning rate η, we set the iteration number such that the total time elapsed (iteration η) is 10. The vertical dashed lines indicate our theoretical prediction of the phase transition phenomenon (precise threshold at η = 8π/d2).
Researcher Affiliation Collaboration Kwangjun Ahn MIT EECS Cambridge, MA kjahn@mit.edu Sébastien Bubeck Microsoft Research Redmond, WA sebubeck@microsoft.com Sinho Chewi Institute for Advanced Study Princeton, NJ schewi@ias.edu Yin Tat Lee Microsoft Research Redmond, WA yintat@uw.edu Felipe Suarez Carnegie Mellon University Pittsburgh, PA felipesc@mit.edu Yi Zhang Microsoft Research Redmond, WA zhayi@mit.edu
Pseudocode No No structured pseudocode or algorithm blocks were found. Gradient descent updates are described in mathematical equations within the text.
Open Source Code No The paper does not contain any explicit statements about releasing source code for the described methodology or a link to a code repository.
Open Datasets Yes We train the parameters a , a+, b using gradient descent with step size η > 0 on the logistic loss Pn i=1 ℓlogi(y(i) f(x(i); a , a+, b)), where ℓlogi(z) := log(1 + exp( z)), and we report the results in Figures 1 and 2. trained on the full sparse coding model (1) with unknown basis, as well as a deep neural network trained on CIFAR-10. sparse coding model (Olshausen and Field, 1997; Vinje and Gallant, 2000; Olshausen and Field, 2004; Yang et al., 2009; Koehler and Risteski, 2018; Allen-Zhu and Li, 2022).
Dataset Splits No No explicit details on train/validation/test dataset splits (e.g., percentages, sample counts, or specific splitting methodology) are provided in the paper.
Hardware Specification No The paper does not provide specific details regarding the hardware (e.g., GPU/CPU models, memory specifications) used for running the experiments.
Software Dependencies No The paper does not provide specific software dependencies or library versions (e.g., PyTorch 1.x, TensorFlow 2.x, Python 3.x) used for the experiments.
Experiment Setup Yes We choose d = 200, n = 300, λ = 3, and run gradient descent with the logistic loss. The weights are initialized as a , a+ N(0, 1/(2d)) and b = 0. For each learning rate η, we set the iteration number such that the total time elapsed (iteration η) is 10.