Aiming towards the minimizers: fast convergence of SGD for overparametrized problems
Authors: Chaoyue Liu, Dmitriy Drusvyatskiy, Misha Belkin, Damek Davis, Yian Ma
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | As a concrete illustration of the disparity between theory and practice, Figure 1 depicts the convergence behavior of SGD for training a neural network on the MNIST data set. In both cases, we observe that the estimate stays positive, which suggests that aiming condition holds. |
| Researcher Affiliation | Academia | Chaoyue Liu*, Dmitriy Drusvyatskiy**, Yian Ma*, Damek Davis***, and Mikhail Belkin* *Halicio glu Data Science Institute, University of California San Diego **Mathematics Department, University of Washington ***School of of Operations Research and Information Engineering, Cornell University |
| Pseudocode | Yes | Algorithm 1 SGD(w0, η, T) |
| Open Source Code | No | The paper does not provide an explicit statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | Figure 1: Convergence plot of SGD when training a fully connected neural network with 3 hidden layers and 1000 neurons in each on MNIST (left) and a Res Net-28 on CIFAR-10 (right). We conduct the experiments on two datasets, MNIST and CIFAR-10. |
| Dataset Splits | No | The paper mentions total image counts for MNIST (60k) and CIFAR-10 (60k) but does not provide explicit training/validation/test split percentages or sample counts, nor does it refer to predefined standard splits for reproduction beyond mentioning the datasets themselves. |
| Hardware Specification | Yes | Specifically, we used the resources from SDSC Expanse GPU compute nodes, and NCSA Delta system, via allocations TG-CIS220009. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies such as programming languages, libraries, or frameworks used in the experiments. |
| Experiment Setup | Yes | We train a fully-connected neural network on the MNIST dataset. The network has 4 hidden layers, each with 1024 neurons. We optimize the MSE loss using SGD with a batch size 512 and a learning rate 0.5. The training was run over 1k epochs, and the ratio E[ ℓ(w, z) 2]/ L(w) 2 is evaluated every 100 epochs. |