Stochastic Anderson Mixing for Nonconvex Stochastic Optimization

Authors: Fuchao Wei, Chenglong Bao, Yang Liu

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Finally, we apply the SAM method to train various neural networks including the vanilla CNN, Res Nets, Wide Res Net, Res Ne Xt, Dense Net and LSTM. Experimental results on image classification and language model demonstrate the advantages of our method.
Researcher Affiliation Academia Fuchao Wei1, Chenglong Bao3,4 , Yang Liu1,2 1Department of Computer Science and Technology, Tsinghua University 2Institute for AI Industry Research, Tsinghua University 3Yau Mathematical Sciences Center, Tsinghua University 4Yanqi Lake Beijing Institute of Mathematical Sciences and Applications
Pseudocode Yes Algorithm 1 Stochastic Anderson Mixing (SAM)
Open Source Code No The paper does not include an unambiguous statement or a direct link to a source-code repository for the methodology described in this paper.
Open Datasets Yes The datasets were MNIST [32], CIFAR-10/CIFAR-100 [31] for image classification and Penn Treebank [35] for language model.
Dataset Splits Yes The training dataset was preprocessed by randomly selecting 12k images from the total 60k images to facilitate large mini-batch training. Neither weight-decay nor dropout was used. [...] For CIFAR-10 and CIFAR100, both datasets have 50K images for training and 10K images for test. [...] reported the perplexity on the validation set in Figure 3 and perplexity on the test set in Table 3
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies No Footnote 3 mentions "Based on the official Py Torch implementation https://github.com/pytorch/examples/blob/master/mnist." This mentions PyTorch but does not specify a version number.
Experiment Setup Yes The learning rate was tuned and fixed for each optimizer. The historical lengths for Sd LBFGS, RAM and Ada SAM were set as 20. δ = 10 6 for RAM and c1 = 10 4 for Ada SAM. [...] We trained 160 epochs with batch size of 128 and decayed the learning rate at the 80th and 120th epoch. For Ada SAM/RAM, αk and βk were decayed at the 80th and 120th epoch.