Symbolic Discovery of Optimization Algorithms

Authors: Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, Quoc V Le

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare Lion with widely used optimizers, such as Adam and Adafactor, for training a variety of models on different tasks. On image classification, Lion boosts the accuracy of Vi T by up to 2% on Image Net and saves up to 5x the pre-training compute on JFT.
Researcher Affiliation Collaboration 1Google 2UCLA
Pseudocode Yes Program 1: Discovered optimizer Lion. β1 = 0.9 and β2 = 0.99 by default are derived from Program 4. It only tracks momentum and uses the sign operation to compute the update. The two gray lines compute the standard decoupled weight decay, where λ is the strength. def train(weight, gradient, momentum, lr): update = interp(gradient, momentum, β1) update = sign(update) momentum = interp(gradient, momentum, β2) weight_decay = weight * λ update = update + weight_decay update = update * lr return update, momentum
Open Source Code No The paper does not provide an explicit statement or link to open-source code for the described methodology.
Open Datasets Yes Our evaluation covers various benchmarks: Image Net, Image Net Rea L [8], Image Net V2 [81], Image Net A [39], Image Net R [38], Image Net Sketch [92], Object Net [4], CIFAR-100 [50], and Oxford-IIIT Pet [69].
Dataset Splits Yes For vision tasks, we train a Vi T with three layers, 96 hidden units and three heads, on 10% Image Net for 30k steps with batch size 64. The image size is 64 64 and the patch size is 16. For language tasks, we train a Transformer with two layers, 128 hidden units and two heads on LM1B [13] for 20K steps with batch size 64, sequence length 32 and vocabulary size 3K. The evaluation time may vary for different programs, but typically a evaluation can be done on one TPU V2 chip within 20min. The validation accuracy or perplexity is used as the fitness.
Hardware Specification Yes Evaluation on the proxies can be completed on one TPU V2 chip within 20min. Each search experiment utilizes 100 TPU V2 chips and runs for 72h. This results in a total cost of 3K TPU V2 days.
Software Dependencies No The paper mentions 'Num Py / JAX' but does not specify version numbers for these or other software dependencies.
Experiment Setup Yes Table 12: Hyperparameters for all the experiments. Model Dropout Stoch Depth Augmentations Optimizer β1 β2 lr λ ... To ensure a fair comparison, we tune the peak learning rate lr and decoupled weight decay λ for both Adam W (Adafactor) and our Lion using a logarithmic scale.