Stochastic Anderson Mixing for Nonconvex Stochastic Optimization
Authors: Fuchao Wei, Chenglong Bao, Yang Liu
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Finally, we apply the SAM method to train various neural networks including the vanilla CNN, Res Nets, Wide Res Net, Res Ne Xt, Dense Net and LSTM. Experimental results on image classification and language model demonstrate the advantages of our method. |
| Researcher Affiliation | Academia | Fuchao Wei1, Chenglong Bao3,4 , Yang Liu1,2 1Department of Computer Science and Technology, Tsinghua University 2Institute for AI Industry Research, Tsinghua University 3Yau Mathematical Sciences Center, Tsinghua University 4Yanqi Lake Beijing Institute of Mathematical Sciences and Applications |
| Pseudocode | Yes | Algorithm 1 Stochastic Anderson Mixing (SAM) |
| Open Source Code | No | The paper does not include an unambiguous statement or a direct link to a source-code repository for the methodology described in this paper. |
| Open Datasets | Yes | The datasets were MNIST [32], CIFAR-10/CIFAR-100 [31] for image classification and Penn Treebank [35] for language model. |
| Dataset Splits | Yes | The training dataset was preprocessed by randomly selecting 12k images from the total 60k images to facilitate large mini-batch training. Neither weight-decay nor dropout was used. [...] For CIFAR-10 and CIFAR100, both datasets have 50K images for training and 10K images for test. [...] reported the perplexity on the validation set in Figure 3 and perplexity on the test set in Table 3 |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | Footnote 3 mentions "Based on the official Py Torch implementation https://github.com/pytorch/examples/blob/master/mnist." This mentions PyTorch but does not specify a version number. |
| Experiment Setup | Yes | The learning rate was tuned and fixed for each optimizer. The historical lengths for Sd LBFGS, RAM and Ada SAM were set as 20. δ = 10 6 for RAM and c1 = 10 4 for Ada SAM. [...] We trained 160 epochs with batch size of 128 and decayed the learning rate at the 80th and 120th epoch. For Ada SAM/RAM, αk and βk were decayed at the 80th and 120th epoch. |