Parameter-free Clipped Gradient Descent Meets Polyak
Authors: Yuki Takezawa, Han Bao, Ryoma Sato, Kenta Niwa, Makoto Yamada
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We numerically validated our convergence results using a synthetic function and demonstrated the effectiveness of our proposed methods using LSTM, Nano-GPT, and T5. |
| Researcher Affiliation | Academia | Yuki Takezawa1,2, Han Bao1,2, Ryoma Sato3, Kenta Niwa4, Makoto Yamada2 1Kyoto University, 2OIST, 3NII, 4NTT Communication Science Laboratories |
| Pseudocode | Yes | Algorithm 1 Inexact Polyak Stepsize |
| Open Source Code | Yes | Our code is contained in the supplementary material. |
| Open Datasets | Yes | For LSTM, Nano-GPT, and T5, we used the Penn Treebank, Shakespeare, and C4 as training datasets, respectively. |
| Dataset Splits | No | For SGD and Clipped SGD, we tuned the stepsize and gradient clipping threshold on validation datasets. |
| Hardware Specification | Yes | We ran all experiments on an A100 GPU. |
| Software Dependencies | No | The paper references specific model implementations (e.g., 'LSTM: https://github.com/salesforce/awd-lstm-lm', 'Nano-GPT: https://github.com/karpathy/nano GPT', 'T5: https://github.com/Piotr Nawrot/nano T5'), which imply software dependencies like PyTorch, but does not list specific version numbers for these or other ancillary software components. |
| Experiment Setup | Yes | In our experiments, we ran the clipped gradient descent with the following hyperparameters and tuned the hyperparameters by grid search. Table 2: Hyperparameter settings for clipped gradient descent. Learning Rate {1, 1.0 10 1, , 1.0 10 8} Gradient Clipping Threshold {0.01, 0.1, 1, 5, 10, 15, 20, }. Table 4: Hyperparameter settings for LSTM. Learning Rate {100, 50, 10, 1, 0.1, 0.01} Gradient Clipping Threshold {0.5, 1, , 4.5, 5, } Batch Size 80. |