Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Celo: Training Versatile Learned Optimizers on a Compute Diet
Authors: Abhinav Moudgil, Boris Knyazev, Guillaume Lajoie, Eugene Belilovsky
TMLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We compare our proposed optimizer, Celo, with 15 state-of-the-art hand-crafted optimizers and 4 learned optimizers. Optimizers are evaluated with final loss (left) and speedup (right) criteria with respect to Adam on a diverse set of 17 tasks which are out-of-distribution for Celo and include image classification, language modeling, autoencoders, learned optimizer training, etc. |
| Researcher Affiliation | Collaboration | 1Mila Quebec AI Institute, 2Concordia University, 3Samsung SAIT AI Lab, Montreal, 4Université de Montréal |
| Pseudocode | Yes | Algorithm 1 Celo update |
| Open Source Code | Yes | 1Code is available at: https://github.com/amoudgl/celo |
| Open Datasets | Yes | We take a set of four meta-training tasks proposed by Metz et al. (2022b) for all our experiments which contains four image-classification datasets including MNIST (Le Cun & Cortes, 1998), Fashion-MNIST (Xiao et al., 2017), SVHN (Netzer et al., 2011), and CIFAR-10 (Krizhevsky et al., 2009). ... Transformer LM. Three transformer decoder language modeling tasks (Radford et al., 2019) on LM1B (Chelba et al., 2013)... RNN LM. Two recurrent neural network (RNN) language modeling tasks on the LM1B32k (Brants et al., 2007; Chelba et al., 2013) and wikipedia32k (Merity et al., 2016) datasets. |
| Dataset Splits | Yes | All the learned optimizers are meta-trained with the same setup on a fixed compute budget i.e. given a fixed set of meta-training tasks, we study how far can we push the meta-generalization performance of learned optimizers. We take a set of four meta-training tasks proposed by Metz et al. (2022b) for all our experiments which contains four image-classification datasets including MNIST (Le Cun & Cortes, 1998), Fashion-MNIST (Xiao et al., 2017), SVHN (Netzer et al., 2011), and CIFAR-10 (Krizhevsky et al., 2009). |
| Hardware Specification | Yes | With this setup, all our meta-training experiments finish in a day (<24 hours) on a single Nvidia RTX8000 GPU. ... We benchmark runtime on a NVIDIA V100 GPU with 10 seeds per optimizer. |
| Software Dependencies | No | We are also grateful to the developers of open-source libraries such as JAX (Bradbury et al., 2018), NumPy (Harris et al., 2020), google/learned_optimization, and google-research/rliable, which were instrumental in this research. For fair comparison, all the learned optimizers are meta-trained with exactly same PES setup... We use Adam (Kingma & Ba, 2015) as the meta optimizer... |
| Experiment Setup | Yes | All the learned optimizers are meta-trained with the same setup on a fixed compute budget i.e. given a fixed set of meta-training tasks... All the learned optimizers are meta-trained with truncated PES (Vicol et al., 2021) with maximum 2K inner unroll length for 100K meta-iterations using mean loss of the inner loop as the meta-objective... For task augmentation (3), we sample τ uniformly on log scale between 0.001 and 1000 in each meta-iteration. ... maximum inner unroll length 2000 with unroll lengths sampled logarithmically sampled between 100 and 2000, standard deviation 0.01, truncation length 50 and mean inner training loss as the meta-objective. We use Adam (Kingma & Ba, 2015) as the meta optimizer and meta-train for 100K iterations on a single Nvidia RTX8000 GPU. ... sweeping over 3 seeds and 5 learning rates [3e-5, 5e-5, 1e-4, 3e-4, 1e-3]... batch size 64 and Re LU activations. |