Revisiting Gradient Clipping: Stochastic bias and tight convergence guarantees
Authors: Anastasia Koloskova, Hadrien Hendrikx, Sebastian U Stich
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | illustrate these results with experiments. In this section, we investigate the performance of gradient clipping on logistic regression on the w1a dataset (Platt, 1998), and on the artificial quadratic function f(x) = Eξ χ2(1) h f(x, ξ) := L 2 x 2 + x, ξ i , where x R100, we choose L = 0.1, and χ2(1) is a (coordinate-wise) chi-squared distribution with 1 degree of freedom. The goal is to highlight our theoretical results. |
| Researcher Affiliation | Academia | 1EPFL, Switzerland 2Inria Grenoble, France (work done in part while at EPFL) 3CISPA Helmholtz Center for Information Security, Germany. |
| Pseudocode | No | The paper describes the clipped gradient descent algorithm using mathematical equations (e.g., 'xt+1 = xt ηgt , with gt = clipc( fξ(xt))'), but it does not present this as a formal pseudocode block or algorithm listing. |
| Open Source Code | No | The paper does not include any explicit statement about releasing source code or provide a link to a code repository for the methodology described. |
| Open Datasets | Yes | In this section, we investigate the performance of gradient clipping on logistic regression on the w1a dataset (Platt, 1998) |
| Dataset Splits | No | The paper mentions using the 'w1a dataset' and an 'artificial quadratic function' for experiments, but it does not specify any dataset splits (e.g., percentages or sample counts for training, validation, or testing). |
| Hardware Specification | No | The paper does not provide any specific details about the hardware (e.g., GPU or CPU models, memory specifications) used to run the experiments. |
| Software Dependencies | No | The paper does not specify any software dependencies with version numbers (e.g., specific libraries, frameworks, or programming language versions) used for the experiments. |
| Experiment Setup | Yes | In Figures (a) and (b) we see that as soon as the clipping threshold is smaller or equal to the target gradient norm ϵ, the convergence speed is affected only by a constant. In Figure (c), we see that as the clipping threshold c decreases, the best tuned stepsize (tuned to reach ϵ = 10-2 fastest) decreases. Logistic regression on w1a dataset (batch sitze = 1). |