Momentum Centering and Asynchronous Update for Adaptive Gradient Methods

Authors: Juntang Zhuang, Yifan Ding, Tommy Tang, Nicha Dvornek, Sekhar C Tatikonda, James Duncan

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate ACProp in extensive empirical studies: ACProp outperforms both SGD and other adaptive optimizers in image classification with CNN, and outperforms well-tuned adaptive optimizers in the training of various GAN models, reinforcement learning and transformers. [...] We conducted experiments on the MNIST dataset using a 2-layer MLP. We plot the average value of vt for uncentered-type and st for centered-type optimizers; as Fig. 6(a,b) shows, we observe st vt and the centered-type (ACProp, Ada Belief) converges faster, validating our analysis for early phases.
Researcher Affiliation Academia Juntang Zhuang1; Yifan Ding2; Tommy Tang3; Nicha Dvornek1; Sekhar Tatikonda1; James S. Duncan1 1 Yale University; 2 University of Central Florida; 3 University of Illinois at Urbana-Champaign
Pseudocode Yes Algorithm 1: Ada Belief Initialize x0, m0 0 , s0 0, t 0 While xt not converged t t + 1 gt xft(xt 1) mt β1mt 1 + (1 β1)gt st β2st 1+(1 β2)(gt mt)2 xt 1 α st+ϵmt ; Algorithm 2: ACProp Initialize x0, m0 0 , s0 0, t 0 While xt not converged t t + 1 gt xft(xt 1) mt β1mt 1 + (1 β1)gt xt Q xt 1 α st 1+ϵgt st β2st 1+(1 β2)(gt mt)2
Open Source Code Yes We provide the implementation at https://github.com/juntang-zhuang/ACProp-Optimizer.
Open Datasets Yes We conducted experiments on the MNIST dataset using a 2-layer MLP. [...] We first conducted experiments on CIFAR10 image classification task with a VGG-11 [31], Res Net34 [6] and Dense Net-121 [32]. [...] for Res Net18 on Image Net [...] We evaluated different optimizers on reinforcement learning with a deep Q-network (DQN) [21] on the four-rooms task [33]. [...] We evaluated the performance of ACProp on neural machine translation tasks with a transformer model [20]. [...] We conducted experiments with Deep Convolutional GAN (DCGAN) [35], Spectral-Norm GAN (SNGAN) [36], Self-Attention GAN (SAGAN) [37] and Relativistic-GAN (RLGAN) [38]. We set β1 = 0.5, and search for β2 and ϵ with the same schedule as previous section. We report the FID [39] on CIFAR10 dataset in Table. 4
Dataset Splits No The paper uses standard public datasets and mentions training, but does not provide explicit details on how the datasets were split into training, validation, and test sets (e.g., percentages, specific split files, or reference to a standard split name for validation).
Hardware Specification No The paper does not provide specific details about the hardware used for running the experiments, such as particular GPU or CPU models, memory specifications, or cloud instance types.
Software Dependencies No The paper does not list specific software dependencies with version numbers (e.g., Python version, specific library versions like PyTorch or TensorFlow) that would be needed for replication.
Experiment Setup Yes We performed extensive hyperparameter tuning in order to better compare the performance of different optimizers: for SGD we set the momentum as 0.9 which is the default for many cases [6, 32], and search the learning rate between 0.1 and 10 5 in the log-grid; for other adaptive optimizers, including Ada Belief, Adam, RAdam, Adam W and Ada Shift, we search the learning rate between 0.01 and 10 5 in the log-grid, and search ϵ between 10 5 and 10 10 in the log-grid. We use a weight decay of 5e-2 for Adam W, and use 5e-4 for other optimizers. [...] For all optimizers, we set learning rate as 0.0002, and search for β1 {0.9, 0.99, 0.999}, β2 {0.98, 0.99, 0.999} and ϵ {10 5, 10 6, ...10 16}.