Chaotic Regularization and Heavy-Tailed Limits for Deterministic Gradient Descent

Authors: Soon Hoe Lim, Yijun Wan, Umut Simsekli

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results are provided to demonstrate the advantages of MPGD. (Abstract) and Section 5: Empirical Results
Researcher Affiliation Academia Soon Hoe Lim Nordita KTH Royal Institute of Technology and Stockholm University soon.hoe.lim@su.se Yijun Wan Département de Mathématiques et Applications École Normale Supérieure, Université PSL wan@clipper.ens.fr Umut Sim sekli DI ENS, École Normale Supérieure, Université PSL, CNRS, INRIA umut.simsekli@inria.fr
Pseudocode No The paper describes algorithms using mathematical equations and recursions but does not include any block explicitly labeled 'Pseudocode' or 'Algorithm'.
Open Source Code Yes Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes] See the Appendix for such details.
Open Datasets Yes We consider the Airfoil Self-Noise Dataset [DG17] from the UCI repository. and We consider training a Res Net-18 classifier [HZRS16] on the CIFAR-10 data set [Kri09].
Dataset Splits Yes Table 2: Res Net-18 trained on CIFAR-10 for 1000 epochs. Here, accuracy gap = training accuracy validation accuracy.
Hardware Specification No The paper states: 'We are grateful to the computational resources provided by the Swedish National Infrastructure for Computing (SNIC) at Chalmers Centre for Computational Science and Engineering (C3SE) partially funded by the Swedish Research Council through grant agreement no. 2018-05973.' This mentions a computing infrastructure but does not provide specific details on the hardware components (e.g., GPU/CPU models, memory).
Software Dependencies No The paper mentions machine learning models and techniques like 'Res Net-18' and 'Nesterov momentum', but it does not specify any software dependencies with version numbers (e.g., 'PyTorch 1.9', 'Python 3.8').
Experiment Setup Yes For the experiments, we start optimizing from the point (u0, 0), where u0 5 U(d) with d = 10, and use the learning rate η = 0.01. We study and compare the behavior of the following schemes: (i) baseline (vanilla GD), (ii) GD with uncorrelated Gaussian noise injection instead, and (iii) MPGD. Figure 2 demonstrates that MPGD can lead to successful optimization of the widening loss whereas the baseline GD and GD with Gaussian perturbations lead to poor solutions. This is in agreement with our analysis of implicit regularization for MPGD, showing that the injected perturbations effectively favor small trace of the loss Hessian, thereby biasing the solution to flatter region of the loss landscape. and For the training, we use a fully connected shallow neural network of width 16 with Re LU activation and train for 3000 epochs with the learning rate η = 0.1, using mean square error (MSE) as the loss and choosing β = 0.5. and Using the setup of [GGP+21], the reference mini-batch SGD is trained using a batch size of 128 (sampling without replacement), Nesterov momentum of 0.9 and weight decay of 0.0005. The learning rate is warmed up from 0.0 to 0.1 over the first 5 epochs and then reduced via cosine annealing to 0 over the course of training for 300 epochs (resulting in 390x300=117,000 update steps).