reproducibilityindex.ai

Learning compositional functions via multiplicative weight updates

Authors: Jeremy Bernstein, Jiawei Zhao, Markus Meister, Ming-Yu Liu, Anima Anandkumar, Yisong Yue

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we benchmark Madam (Algorithm 1) with weights represented in 32-bit ﬂoating point. In the next section, we shall benchmark B-bit Madam. The results in this section show that across various tasks, including image classiﬁcation, language modeling and image generation Madam without learning rate tuning is competitive with a tuned SGD or Adam. In Figure 1, we show the results of a learning rate grid search undertaken for Madam, SGD and Adam. The optimal learning rate setting for each benchmark is shown with a red cross. Notice that for Madam the optimal learning rate is in all cases 0.01, whereas for SGD and Adam it varies. In Table 1, we compare the ﬁnal results using tuned learning rates for Adam and SGD and using = 0.01 for Madam.
Researcher Affiliation	Collaboration	Jeremy Bernstein Caltech / NVIDIA, Ming-Yu Liu NVIDIA, Jiawei Zhao NVIDIA, Anima Anandkumar Caltech / NVIDIA, Markus Meister Caltech
Pseudocode	Yes	Algorithm 1 Madam a multiplicative adaptive moments based optimiser.
Open Source Code	Yes	The code for these experiments is to be found at https://github.com/ jxbz/madam.
Open Datasets	Yes	CIFAR-10, CIFAR-100, ImageNet, Wikitext-2. CIFAR-10 [44], ImageNet [46], Wikitext-2 [48].
Dataset Splits	No	The paper mentions using a 'validation' step within Algorithm 1 for the 'clamp' function, but it does not explicitly provide details about training/validation/test dataset splits. It mentions 'epoch budgets lifted from the baseline implementations' which might imply standard splits were used, but no specific percentages or sample counts are given within the paper.
Hardware Specification	No	The paper discusses hardware concepts related to low-precision arithmetic (bfloat16, TPUs, NVIDIA GPUs) but does not specify the exact hardware (e.g., GPU models, CPU models, memory) used to run their own experiments.
Software Dependencies	No	The paper mentions PyTorch [37] but does not specify its version number or any other software dependencies with version numbers, which are necessary for full reproducibility.
Experiment Setup	Yes	Good default hyperparameters are: = 0.01, = 8 , σ = 3σ, β = 0.999. σ can be lifted from a standard initialisation. Numerical representation: initial weight scale σ; max weight σ . Optimisation parameters: typical perturbation ; max perturbation ; averaging constant β. Weight initialisation: initialise weights randomly on scale σ, for example: W NORMAL(0, σ). In all 12-bit runs, a learning rate of = 0.01 combined with a base precision 0 = 0.001 could be relied upon to achieve stable learning.