Parameter-free Clipped Gradient Descent Meets Polyak

Authors: Yuki Takezawa, Han Bao, Ryoma Sato, Kenta Niwa, Makoto Yamada

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We numerically validated our convergence results using a synthetic function and demonstrated the effectiveness of our proposed methods using LSTM, Nano-GPT, and T5.
Researcher Affiliation Academia Yuki Takezawa1,2, Han Bao1,2, Ryoma Sato3, Kenta Niwa4, Makoto Yamada2 1Kyoto University, 2OIST, 3NII, 4NTT Communication Science Laboratories
Pseudocode Yes Algorithm 1 Inexact Polyak Stepsize
Open Source Code Yes Our code is contained in the supplementary material.
Open Datasets Yes For LSTM, Nano-GPT, and T5, we used the Penn Treebank, Shakespeare, and C4 as training datasets, respectively.
Dataset Splits No For SGD and Clipped SGD, we tuned the stepsize and gradient clipping threshold on validation datasets.
Hardware Specification Yes We ran all experiments on an A100 GPU.
Software Dependencies No The paper references specific model implementations (e.g., 'LSTM: https://github.com/salesforce/awd-lstm-lm', 'Nano-GPT: https://github.com/karpathy/nano GPT', 'T5: https://github.com/Piotr Nawrot/nano T5'), which imply software dependencies like PyTorch, but does not list specific version numbers for these or other ancillary software components.
Experiment Setup Yes In our experiments, we ran the clipped gradient descent with the following hyperparameters and tuned the hyperparameters by grid search. Table 2: Hyperparameter settings for clipped gradient descent. Learning Rate {1, 1.0 10 1, , 1.0 10 8} Gradient Clipping Threshold {0.01, 0.1, 1, 5, 10, 15, 20, }. Table 4: Hyperparameter settings for LSTM. Learning Rate {100, 50, 10, 1, 0.1, 0.01} Gradient Clipping Threshold {0.5, 1, , 4.5, 5, } Batch Size 80.