Beyond Lipschitz: Sharp Generalization and Excess Risk Bounds for Full-Batch GD

Authors: Konstantinos Nikolakakis, Farzin Haddadpour, Amin Karbasi, Dionysios Kalogerias

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical We provide sharp path-dependent generalization and excess risk guarantees for the full-batch Gradient Descent (GD) algorithm on smooth losses (possibly non Lipschitz, possibly nonconvex). At the heart of our analysis is an upper bound on the generalization error, which implies that average output stability and a bounded expected optimization error at termination lead to generalization. This result shows that a small generalization error occurs along the optimization path, and allows us to bypass Lipschitz or sub-Gaussian assumptions on the loss prevalent in previous works. For nonconvex, convex, and strongly convex losses, we show the explicit dependence of the generalization error in terms of the accumulated path-dependent optimization error, terminal optimization error, number of samples, and number of iterations. Our main purpose in this work is to theoretically establish that full-batch GD indeed generalizes efficiently for general smooth losses.
Researcher Affiliation Collaboration Konstantinos E. Nikolakakis , Farzin Haddadpour, Amin Karbasi1&Dionysios S. Kalogerias Department of Electrical Engineering Yale University, Yale University & Google Research1 {first.last}@yale.edu
Pseudocode No The paper describes the Gradient Descent algorithm mathematically (e.g., equation 41: Wt+1 = Wt - ηt * (1/n) * sum(grad f(Wt, zj))) but does not present it in a pseudocode block or a clearly labeled algorithm figure.
Open Source Code No The paper is theoretical and focuses on mathematical proofs and derivations. It does not mention releasing any source code or provide links to a code repository.
Open Datasets No The paper is purely theoretical and does not involve experimental evaluation on datasets. It refers to “a dataset S {zi}n i=1 of i.i.d samples zi from an unknown distribution D” in its problem statement, but this is a theoretical construct for defining risk, not an actual dataset used for training.
Dataset Splits No The paper is purely theoretical and does not conduct experiments involving dataset splits (training, validation, or test). The concept of 'validation' in the paper refers to mathematical validation of bounds, not data partitioning.
Hardware Specification No The paper is purely theoretical and does not describe any experiments that would require specific hardware. Therefore, no hardware specifications are mentioned.
Software Dependencies No The paper is theoretical and focuses on mathematical derivations and proofs. It does not mention any software dependencies or their specific version numbers that would be required to reproduce experiments.
Experiment Setup No The paper is purely theoretical and does not involve conducting experiments. Therefore, there is no description of an experimental setup, hyperparameters, or training configurations.