Scalable Optimization in the Modular Norm

Authors: Tim Large, Yang Liu, Jacob Huh, Hyojin Bahng, Phillip Isola, Jeremy Bernstein

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments aimed to test the scalability of training with normed versions of Adam and SGD: whether one can tune the learning rate on a small model, and expect the learning rate to remain close to optimal on models of much larger width and depth.
Researcher Affiliation Collaboration Tim Large Yang Liu Minyoung Huh Columbia University Lawrence Livermore National Lab MIT CSAIL Hyojin Bahng Phillip Isola Jeremy Bernstein MIT CSAIL MIT CSAIL MIT CSAIL
Pseudocode Yes In pseudo-code and actual Modula code this amounts to: delta_w = optim(w.grad()) # get update from base optimizer net.normalize(delta_w) # normalize update in the modular norm w -= eta(step) * delta_w # apply update with learning rate eta
Open Source Code Yes We have created a Python package called Modula that automatically normalizes weight updates in the modular norm of the architecture. The package is available via pip install modula with source code here.
Open Datasets Yes All experiments with Res MLP and Res Net [45] are done with the CIFAR-10 [46] image dataset with standard train and test splits. For the GPT [43] transformer experiments, we compared three different datasets: (a) The Shakespeare corpus, using character-level tokens [47]; (b) The Tiny Stories database [48] using sub-word level tokenization; (c) Open Web Text using sub-word level tokenization [49].
Dataset Splits Yes All experiments with Res MLP and Res Net [45] are done with the CIFAR-10 [46] image dataset with standard train and test splits.
Hardware Specification Yes All experiments were run on NVIDIA GPUs using float32-precision. We used a combination of TITAN-RTX, RTX-3090, V100, Ada6000, and H100 devices.
Software Dependencies No The paper mentions PyTorch and its own Modula package, but does not specify version numbers for these or other ancillary software components needed for reproduction.
Experiment Setup Yes All SGD experiments were done with momentum β = 0.9, and all Adam experiments used β1 = 0.9 and β2 = 0.99. No weight decay was used in any experiment. Every experiment was done with a linear decay learning rate schedule.