Revisiting Gradient Clipping: Stochastic bias and tight convergence guarantees

Authors: Anastasia Koloskova, Hadrien Hendrikx, Sebastian U Stich

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental illustrate these results with experiments. In this section, we investigate the performance of gradient clipping on logistic regression on the w1a dataset (Platt, 1998), and on the artificial quadratic function f(x) = Eξ χ2(1) h f(x, ξ) := L 2 x 2 + x, ξ i , where x R100, we choose L = 0.1, and χ2(1) is a (coordinate-wise) chi-squared distribution with 1 degree of freedom. The goal is to highlight our theoretical results.
Researcher Affiliation Academia 1EPFL, Switzerland 2Inria Grenoble, France (work done in part while at EPFL) 3CISPA Helmholtz Center for Information Security, Germany.
Pseudocode No The paper describes the clipped gradient descent algorithm using mathematical equations (e.g., 'xt+1 = xt ηgt , with gt = clipc( fξ(xt))'), but it does not present this as a formal pseudocode block or algorithm listing.
Open Source Code No The paper does not include any explicit statement about releasing source code or provide a link to a code repository for the methodology described.
Open Datasets Yes In this section, we investigate the performance of gradient clipping on logistic regression on the w1a dataset (Platt, 1998)
Dataset Splits No The paper mentions using the 'w1a dataset' and an 'artificial quadratic function' for experiments, but it does not specify any dataset splits (e.g., percentages or sample counts for training, validation, or testing).
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU or CPU models, memory specifications) used to run the experiments.
Software Dependencies No The paper does not specify any software dependencies with version numbers (e.g., specific libraries, frameworks, or programming language versions) used for the experiments.
Experiment Setup Yes In Figures (a) and (b) we see that as soon as the clipping threshold is smaller or equal to the target gradient norm ϵ, the convergence speed is affected only by a constant. In Figure (c), we see that as the clipping threshold c decreases, the best tuned stepsize (tuned to reach ϵ = 10-2 fastest) decreases. Logistic regression on w1a dataset (batch sitze = 1).