Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Celo: Training Versatile Learned Optimizers on a Compute Diet

Authors: Abhinav Moudgil, Boris Knyazev, Guillaume Lajoie, Eugene Belilovsky

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We compare our proposed optimizer, Celo, with 15 state-of-the-art hand-crafted optimizers and 4 learned optimizers. Optimizers are evaluated with final loss (left) and speedup (right) criteria with respect to Adam on a diverse set of 17 tasks which are out-of-distribution for Celo and include image classification, language modeling, autoencoders, learned optimizer training, etc.
Researcher Affiliation Collaboration 1Mila Quebec AI Institute, 2Concordia University, 3Samsung SAIT AI Lab, Montreal, 4Université de Montréal
Pseudocode Yes Algorithm 1 Celo update
Open Source Code Yes 1Code is available at: https://github.com/amoudgl/celo
Open Datasets Yes We take a set of four meta-training tasks proposed by Metz et al. (2022b) for all our experiments which contains four image-classification datasets including MNIST (Le Cun & Cortes, 1998), Fashion-MNIST (Xiao et al., 2017), SVHN (Netzer et al., 2011), and CIFAR-10 (Krizhevsky et al., 2009). ... Transformer LM. Three transformer decoder language modeling tasks (Radford et al., 2019) on LM1B (Chelba et al., 2013)... RNN LM. Two recurrent neural network (RNN) language modeling tasks on the LM1B32k (Brants et al., 2007; Chelba et al., 2013) and wikipedia32k (Merity et al., 2016) datasets.
Dataset Splits Yes All the learned optimizers are meta-trained with the same setup on a fixed compute budget i.e. given a fixed set of meta-training tasks, we study how far can we push the meta-generalization performance of learned optimizers. We take a set of four meta-training tasks proposed by Metz et al. (2022b) for all our experiments which contains four image-classification datasets including MNIST (Le Cun & Cortes, 1998), Fashion-MNIST (Xiao et al., 2017), SVHN (Netzer et al., 2011), and CIFAR-10 (Krizhevsky et al., 2009).
Hardware Specification Yes With this setup, all our meta-training experiments finish in a day (<24 hours) on a single Nvidia RTX8000 GPU. ... We benchmark runtime on a NVIDIA V100 GPU with 10 seeds per optimizer.
Software Dependencies No We are also grateful to the developers of open-source libraries such as JAX (Bradbury et al., 2018), NumPy (Harris et al., 2020), google/learned_optimization, and google-research/rliable, which were instrumental in this research. For fair comparison, all the learned optimizers are meta-trained with exactly same PES setup... We use Adam (Kingma & Ba, 2015) as the meta optimizer...
Experiment Setup Yes All the learned optimizers are meta-trained with the same setup on a fixed compute budget i.e. given a fixed set of meta-training tasks... All the learned optimizers are meta-trained with truncated PES (Vicol et al., 2021) with maximum 2K inner unroll length for 100K meta-iterations using mean loss of the inner loop as the meta-objective... For task augmentation (3), we sample τ uniformly on log scale between 0.001 and 1000 in each meta-iteration. ... maximum inner unroll length 2000 with unroll lengths sampled logarithmically sampled between 100 and 2000, standard deviation 0.01, truncation length 50 and mean inner training loss as the meta-objective. We use Adam (Kingma & Ba, 2015) as the meta optimizer and meta-train for 100K iterations on a single Nvidia RTX8000 GPU. ... sweeping over 3 seeds and 5 learning rates [3e-5, 5e-5, 1e-4, 3e-4, 1e-3]... batch size 64 and Re LU activations.