Learning compositional functions via multiplicative weight updates
Authors: Jeremy Bernstein, Jiawei Zhao, Markus Meister, Ming-Yu Liu, Anima Anandkumar, Yisong Yue
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we benchmark Madam (Algorithm 1) with weights represented in 32-bit floating point. In the next section, we shall benchmark B-bit Madam. The results in this section show that across various tasks, including image classification, language modeling and image generation Madam without learning rate tuning is competitive with a tuned SGD or Adam. In Figure 1, we show the results of a learning rate grid search undertaken for Madam, SGD and Adam. The optimal learning rate setting for each benchmark is shown with a red cross. Notice that for Madam the optimal learning rate is in all cases 0.01, whereas for SGD and Adam it varies. In Table 1, we compare the final results using tuned learning rates for Adam and SGD and using = 0.01 for Madam. |
| Researcher Affiliation | Collaboration | Jeremy Bernstein Caltech / NVIDIA, Ming-Yu Liu NVIDIA, Jiawei Zhao NVIDIA, Anima Anandkumar Caltech / NVIDIA, Markus Meister Caltech |
| Pseudocode | Yes | Algorithm 1 Madam a multiplicative adaptive moments based optimiser. |
| Open Source Code | Yes | The code for these experiments is to be found at https://github.com/ jxbz/madam. |
| Open Datasets | Yes | CIFAR-10, CIFAR-100, ImageNet, Wikitext-2. CIFAR-10 [44], ImageNet [46], Wikitext-2 [48]. |
| Dataset Splits | No | The paper mentions using a 'validation' step within Algorithm 1 for the 'clamp' function, but it does not explicitly provide details about training/validation/test dataset splits. It mentions 'epoch budgets lifted from the baseline implementations' which might imply standard splits were used, but no specific percentages or sample counts are given within the paper. |
| Hardware Specification | No | The paper discusses hardware concepts related to low-precision arithmetic (bfloat16, TPUs, NVIDIA GPUs) but does not specify the exact hardware (e.g., GPU models, CPU models, memory) used to run *their own experiments*. |
| Software Dependencies | No | The paper mentions PyTorch [37] but does not specify its version number or any other software dependencies with version numbers, which are necessary for full reproducibility. |
| Experiment Setup | Yes | Good default hyperparameters are: = 0.01, = 8 , σ = 3σ, β = 0.999. σ can be lifted from a standard initialisation. Numerical representation: initial weight scale σ; max weight σ . Optimisation parameters: typical perturbation ; max perturbation ; averaging constant β. Weight initialisation: initialise weights randomly on scale σ, for example: W NORMAL(0, σ). In all 12-bit runs, a learning rate of = 0.01 combined with a base precision 0 = 0.001 could be relied upon to achieve stable learning. |