Stability and Convergence of Stochastic Gradient Clipping: Beyond Lipschitz Continuity and Smoothness

Authors: Vien V. Mai, Mikael Johansson

ICML 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Numerical results confirm our theoretical developments. Our experiments on phase retrieval, absolute linear regression, and classification with neural networks reaffirm our theoretical findings that gradient clipping can: i) stabilize and guarantee convergence for problems with rapidly growing gradients; ii) retain and sometimes improve the best performance of their unclipped counterparts even on standard problems.
Researcher Affiliation Academia 1Division of Decision and Control Systems, EECS, KTH Royal Institute of Technology, Stockholm, Sweden.
Pseudocode No The algorithm is described using mathematical equations (4a) and (4b) but not presented in a formal pseudocode or algorithm block.
Open Source Code No The paper does not provide an explicit statement or link for the open-sourcing of the code for the methodology described.
Open Datasets Yes For our last set of experiments, we consider the image classification task on the CIFAR10 dataset (Krizhevsky et al., 2009)
Dataset Splits No The paper mentions using the CIFAR10 dataset and mini-batch size but does not explicitly provide details about specific training, validation, or test dataset splits (e.g., percentages or sample counts).
Hardware Specification No The computations were enabled by resources provided by the Swedish National Infrastructure for Computing (SNIC), partially funded by the Swedish Research Council through grant agreement no. 2018-05973. This statement is too general and does not provide specific hardware models or specifications.
Software Dependencies No The paper mentions 'Py Torch' but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes Following common practice, we use mini-batch size 128, momentum parameter β = 0.9, and weight-decay coefficient 5 10 4 in all experiments. For the stepsizes, we use constant values starting with α0 and reduce them by a factor of 10 every 50 epochs.