Scalable Optimization in the Modular Norm
Authors: Tim Large, Yang Liu, Jacob Huh, Hyojin Bahng, Phillip Isola, Jeremy Bernstein
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments aimed to test the scalability of training with normed versions of Adam and SGD: whether one can tune the learning rate on a small model, and expect the learning rate to remain close to optimal on models of much larger width and depth. |
| Researcher Affiliation | Collaboration | Tim Large Yang Liu Minyoung Huh Columbia University Lawrence Livermore National Lab MIT CSAIL Hyojin Bahng Phillip Isola Jeremy Bernstein MIT CSAIL MIT CSAIL MIT CSAIL |
| Pseudocode | Yes | In pseudo-code and actual Modula code this amounts to: delta_w = optim(w.grad()) # get update from base optimizer net.normalize(delta_w) # normalize update in the modular norm w -= eta(step) * delta_w # apply update with learning rate eta |
| Open Source Code | Yes | We have created a Python package called Modula that automatically normalizes weight updates in the modular norm of the architecture. The package is available via pip install modula with source code here. |
| Open Datasets | Yes | All experiments with Res MLP and Res Net [45] are done with the CIFAR-10 [46] image dataset with standard train and test splits. For the GPT [43] transformer experiments, we compared three different datasets: (a) The Shakespeare corpus, using character-level tokens [47]; (b) The Tiny Stories database [48] using sub-word level tokenization; (c) Open Web Text using sub-word level tokenization [49]. |
| Dataset Splits | Yes | All experiments with Res MLP and Res Net [45] are done with the CIFAR-10 [46] image dataset with standard train and test splits. |
| Hardware Specification | Yes | All experiments were run on NVIDIA GPUs using float32-precision. We used a combination of TITAN-RTX, RTX-3090, V100, Ada6000, and H100 devices. |
| Software Dependencies | No | The paper mentions PyTorch and its own Modula package, but does not specify version numbers for these or other ancillary software components needed for reproduction. |
| Experiment Setup | Yes | All SGD experiments were done with momentum β = 0.9, and all Adam experiments used β1 = 0.9 and β2 = 0.99. No weight decay was used in any experiment. Every experiment was done with a linear decay learning rate schedule. |