Symbolic Discovery of Optimization Algorithms
Authors: Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, Quoc V Le
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare Lion with widely used optimizers, such as Adam and Adafactor, for training a variety of models on different tasks. On image classification, Lion boosts the accuracy of Vi T by up to 2% on Image Net and saves up to 5x the pre-training compute on JFT. |
| Researcher Affiliation | Collaboration | 1Google 2UCLA |
| Pseudocode | Yes | Program 1: Discovered optimizer Lion. β1 = 0.9 and β2 = 0.99 by default are derived from Program 4. It only tracks momentum and uses the sign operation to compute the update. The two gray lines compute the standard decoupled weight decay, where λ is the strength. def train(weight, gradient, momentum, lr): update = interp(gradient, momentum, β1) update = sign(update) momentum = interp(gradient, momentum, β2) weight_decay = weight * λ update = update + weight_decay update = update * lr return update, momentum |
| Open Source Code | No | The paper does not provide an explicit statement or link to open-source code for the described methodology. |
| Open Datasets | Yes | Our evaluation covers various benchmarks: Image Net, Image Net Rea L [8], Image Net V2 [81], Image Net A [39], Image Net R [38], Image Net Sketch [92], Object Net [4], CIFAR-100 [50], and Oxford-IIIT Pet [69]. |
| Dataset Splits | Yes | For vision tasks, we train a Vi T with three layers, 96 hidden units and three heads, on 10% Image Net for 30k steps with batch size 64. The image size is 64 64 and the patch size is 16. For language tasks, we train a Transformer with two layers, 128 hidden units and two heads on LM1B [13] for 20K steps with batch size 64, sequence length 32 and vocabulary size 3K. The evaluation time may vary for different programs, but typically a evaluation can be done on one TPU V2 chip within 20min. The validation accuracy or perplexity is used as the fitness. |
| Hardware Specification | Yes | Evaluation on the proxies can be completed on one TPU V2 chip within 20min. Each search experiment utilizes 100 TPU V2 chips and runs for 72h. This results in a total cost of 3K TPU V2 days. |
| Software Dependencies | No | The paper mentions 'Num Py / JAX' but does not specify version numbers for these or other software dependencies. |
| Experiment Setup | Yes | Table 12: Hyperparameters for all the experiments. Model Dropout Stoch Depth Augmentations Optimizer β1 β2 lr λ ... To ensure a fair comparison, we tune the peak learning rate lr and decoupled weight decay λ for both Adam W (Adafactor) and our Lion using a logarithmic scale. |