Aiming towards the minimizers: fast convergence of SGD for overparametrized problems

Authors: Chaoyue Liu, Dmitriy Drusvyatskiy, Misha Belkin, Damek Davis, Yian Ma

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental As a concrete illustration of the disparity between theory and practice, Figure 1 depicts the convergence behavior of SGD for training a neural network on the MNIST data set. In both cases, we observe that the estimate stays positive, which suggests that aiming condition holds.
Researcher Affiliation Academia Chaoyue Liu*, Dmitriy Drusvyatskiy**, Yian Ma*, Damek Davis***, and Mikhail Belkin* *Halicio glu Data Science Institute, University of California San Diego **Mathematics Department, University of Washington ***School of of Operations Research and Information Engineering, Cornell University
Pseudocode Yes Algorithm 1 SGD(w0, η, T)
Open Source Code No The paper does not provide an explicit statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes Figure 1: Convergence plot of SGD when training a fully connected neural network with 3 hidden layers and 1000 neurons in each on MNIST (left) and a Res Net-28 on CIFAR-10 (right). We conduct the experiments on two datasets, MNIST and CIFAR-10.
Dataset Splits No The paper mentions total image counts for MNIST (60k) and CIFAR-10 (60k) but does not provide explicit training/validation/test split percentages or sample counts, nor does it refer to predefined standard splits for reproduction beyond mentioning the datasets themselves.
Hardware Specification Yes Specifically, we used the resources from SDSC Expanse GPU compute nodes, and NCSA Delta system, via allocations TG-CIS220009.
Software Dependencies No The paper does not provide specific version numbers for software dependencies such as programming languages, libraries, or frameworks used in the experiments.
Experiment Setup Yes We train a fully-connected neural network on the MNIST dataset. The network has 4 hidden layers, each with 1024 neurons. We optimize the MSE loss using SGD with a batch size 512 and a learning rate 0.5. The training was run over 1k epochs, and the ratio E[ ℓ(w, z) 2]/ L(w) 2 is evaluated every 100 epochs.