Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Regularized Gradient Clipping Provably Trains Wide and Deep Neural Networks

Authors: Matteo Tucat, Anirbit Mukherjee, Mingfei Sun, Procheta Sen, Omar Rivasplata

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We also present empirical evidence that this theoretically founded δ-GClip algorithm is competitive with the state-of-the-art deep learning heuristics on various neural architectures including modern transformer based architectures. The experiments cover text as well as image data. This section contains experimental evidence for δ-GClip.
Researcher Affiliation Academia Matteo Tucat EMAIL Department of Computer Science The University of Manchester Anirbit Mukherjee EMAIL Department of Computer Science The University of Manchester Mingfei Sun EMAIL Department of Computer Science The University of Manchester Procheta Sen EMAIL Department of Computer Science University of Liverpool Omar Rivasplata EMAIL Department of Computer Science The University of Manchester
Pseudocode No The paper defines algorithms GClip, δ-Regularized-GClip, and Stochastic δ-Regularized-GClip using mathematical definitions (Definition 1, 2, and 5) with formulas, rather than structured pseudocode or algorithm blocks.
Open Source Code Yes The code for all our experiments can be found in the Git Hub repository.1 1https://github.com/mingfeisun/delta-gclip
Open Datasets Yes The first set is on the standard Res Net-18 (He et al., 2016) being trained on the benchmark CIFAR-10 (Krizhevsky, 2009) dataset... The second set of experiments is training a VAE model on the Fashion-MNIST dataset...
Dataset Splits Yes The first set is on the standard Res Net-18 (He et al., 2016) being trained on the benchmark CIFAR-10 (Krizhevsky, 2009) dataset, which we recall is a 10-class image classification task with 50,000 training images and 10,000 test images. The second set of experiments is training a VAE model on the Fashion-MNIST dataset, with 60,000 training samples and 10,000 for testing.
Hardware Specification Yes We ran all experiments of this segment using a standard desktop with a Ge Force RTX 2060 graphics card.
Software Dependencies No The paper mentions "used the standard Pytorch optimizers" but does not specify a version number for PyTorch or any other software.
Experiment Setup Yes The Res Net-18 was trained using the full training set using mini-batches of size 512. We tested all the following hyperparameter combinations: η {0.0001,0.001,0.01,0.1,1,5}, γ {0.25,1,5,10} and δ {1e 3,1e 8} for each optimizer. For Adam, only the learning rate (η) was modified, the rest were left at the Py Torch defaults (β1 = 0.9, β2 = 0.999, ε = 1e 8). In the case with scheduling the η value quoted in the legend denotes the η value at epoch 0, i.e. before any reductions by the scheduling algorithm are done. In here, we start at larger η values and divide η by 10 at epochs 100 and 150, following the setup from Zhang et al. (2020a). The training is done using the cross-entropy loss and Re LU gate nets, and using weight-decay (of 5e 4).