Learning compositional functions via multiplicative weight updates

Authors: Jeremy Bernstein, Jiawei Zhao, Markus Meister, Ming-Yu Liu, Anima Anandkumar, Yisong Yue

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we benchmark Madam (Algorithm 1) with weights represented in 32-bit floating point. In the next section, we shall benchmark B-bit Madam. The results in this section show that across various tasks, including image classification, language modeling and image generation Madam without learning rate tuning is competitive with a tuned SGD or Adam. In Figure 1, we show the results of a learning rate grid search undertaken for Madam, SGD and Adam. The optimal learning rate setting for each benchmark is shown with a red cross. Notice that for Madam the optimal learning rate is in all cases 0.01, whereas for SGD and Adam it varies. In Table 1, we compare the final results using tuned learning rates for Adam and SGD and using = 0.01 for Madam.
Researcher Affiliation Collaboration Jeremy Bernstein Caltech / NVIDIA, Ming-Yu Liu NVIDIA, Jiawei Zhao NVIDIA, Anima Anandkumar Caltech / NVIDIA, Markus Meister Caltech
Pseudocode Yes Algorithm 1 Madam a multiplicative adaptive moments based optimiser.
Open Source Code Yes The code for these experiments is to be found at https://github.com/ jxbz/madam.
Open Datasets Yes CIFAR-10, CIFAR-100, ImageNet, Wikitext-2. CIFAR-10 [44], ImageNet [46], Wikitext-2 [48].
Dataset Splits No The paper mentions using a 'validation' step within Algorithm 1 for the 'clamp' function, but it does not explicitly provide details about training/validation/test dataset splits. It mentions 'epoch budgets lifted from the baseline implementations' which might imply standard splits were used, but no specific percentages or sample counts are given within the paper.
Hardware Specification No The paper discusses hardware concepts related to low-precision arithmetic (bfloat16, TPUs, NVIDIA GPUs) but does not specify the exact hardware (e.g., GPU models, CPU models, memory) used to run *their own experiments*.
Software Dependencies No The paper mentions PyTorch [37] but does not specify its version number or any other software dependencies with version numbers, which are necessary for full reproducibility.
Experiment Setup Yes Good default hyperparameters are: = 0.01, = 8 , σ = 3σ, β = 0.999. σ can be lifted from a standard initialisation. Numerical representation: initial weight scale σ; max weight σ . Optimisation parameters: typical perturbation ; max perturbation ; averaging constant β. Weight initialisation: initialise weights randomly on scale σ, for example: W NORMAL(0, σ). In all 12-bit runs, a learning rate of = 0.01 combined with a base precision 0 = 0.001 could be relied upon to achieve stable learning.