Uniform-in-Time Wasserstein Stability Bounds for (Noisy) Stochastic Gradient Descent

Authors: Lingjiong Zhu, Mert Gurbuzbalaban, Anant Raj, Umut Simsekli

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical In this study, we make a novel connection between learning theory and applied probability and introduce a unified guideline for proving Wasserstein stability bounds for stochastic optimization algorithms. We illustrate our approach on stochastic gradient descent (SGD) and we obtain time-uniform stability bounds (i.e., the bound does not increase with the number of iterations) for strongly convex losses and non-convex losses with additive noise, where we recover similar results to the prior art or extend them to more general cases by using a single proof technique. Our approach is flexible and can be generalizable to other popular optimizers, as it mainly requires developing Lyapunov functions, which are often readily available in the literature. It also illustrates that ergodicity is an important component for obtaining time-uniform bounds which might not be achieved for convex or non-convex losses unless additional noise is injected to the iterates. Finally, we slightly stretch our analysis technique and prove time-uniform bounds for SGD under convex and non-convex losses (without additional additive noise), which, to our knowledge, is novel. ... The proofs are provided in the Appendix. ... The main limitation of our approach is that it requires Lipschitz surrogate loss functions, as it is based on the Wasserstein distance. Hence, our natural next step will be to extend our analysis without such a requirement. Finally, due to the theoretical nature of this study, it does not contain any direct potential societal impacts.
Researcher Affiliation Academia Lingjiong Zhu1, Mert Gürbüzbalaban2,3, Anant Raj4,5, Umut Sim sekli5 1: Dept. of Mathematics, Florida State University 2: Dept. of Management Science and Information Systems, Rutgers Business School 3: Center for Statistics and Machine Learning, Princeton University 4: Coordinated Science Laboratory, University of Illinois Urbana-Champaign 5: Inria Paris, CNRS, Ecole Normale Supérieure, PSL Research University
Pseudocode No The paper describes mathematical recursions (e.g., θk = θk 1 η ˆFk(θk 1, Xn)) within the text, but it does not provide any explicitly labeled pseudocode blocks or algorithms.
Open Source Code No The paper does not contain any statement about releasing source code or links to a code repository.
Open Datasets No The paper is theoretical and does not conduct experiments on specific, named, and publicly available datasets. It discusses 'data distribution' and 'finite data set' as part of the theoretical framework.
Dataset Splits No The paper is theoretical and does not describe experimental setup involving dataset splits for training, validation, or testing.
Hardware Specification No The paper is theoretical and does not describe any experimental setup or the hardware used for computations.
Software Dependencies No The paper is theoretical and does not describe any software dependencies or versions used for experimental work.
Experiment Setup No The paper is theoretical and does not describe an experimental setup with specific hyperparameters or training configurations.