reproducibilityindex.ai

Symbolic Discovery of Optimization Algorithms

Authors: Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, Quoc V Le

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We compare Lion with widely used optimizers, such as Adam and Adafactor, for training a variety of models on different tasks. On image classification, Lion boosts the accuracy of Vi T by up to 2% on Image Net and saves up to 5x the pre-training compute on JFT.
Researcher Affiliation	Collaboration	1Google 2UCLA
Pseudocode	Yes	Program 1: Discovered optimizer Lion. β1 = 0.9 and β2 = 0.99 by default are derived from Program 4. It only tracks momentum and uses the sign operation to compute the update. The two gray lines compute the standard decoupled weight decay, where λ is the strength. def train(weight, gradient, momentum, lr): update = interp(gradient, momentum, β1) update = sign(update) momentum = interp(gradient, momentum, β2) weight_decay = weight * λ update = update + weight_decay update = update * lr return update, momentum
Open Source Code	No	The paper does not provide an explicit statement or link to open-source code for the described methodology.
Open Datasets	Yes	Our evaluation covers various benchmarks: Image Net, Image Net Rea L [8], Image Net V2 [81], Image Net A [39], Image Net R [38], Image Net Sketch [92], Object Net [4], CIFAR-100 [50], and Oxford-IIIT Pet [69].
Dataset Splits	Yes	For vision tasks, we train a Vi T with three layers, 96 hidden units and three heads, on 10% Image Net for 30k steps with batch size 64. The image size is 64 64 and the patch size is 16. For language tasks, we train a Transformer with two layers, 128 hidden units and two heads on LM1B [13] for 20K steps with batch size 64, sequence length 32 and vocabulary size 3K. The evaluation time may vary for different programs, but typically a evaluation can be done on one TPU V2 chip within 20min. The validation accuracy or perplexity is used as the fitness.
Hardware Specification	Yes	Evaluation on the proxies can be completed on one TPU V2 chip within 20min. Each search experiment utilizes 100 TPU V2 chips and runs for 72h. This results in a total cost of 3K TPU V2 days.
Software Dependencies	No	The paper mentions 'Num Py / JAX' but does not specify version numbers for these or other software dependencies.
Experiment Setup	Yes	Table 12: Hyperparameters for all the experiments. Model Dropout Stoch Depth Augmentations Optimizer β1 β2 lr λ ... To ensure a fair comparison, we tune the peak learning rate lr and decoupled weight decay λ for both Adam W (Adafactor) and our Lion using a logarithmic scale.