AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods

Authors: Zhiming Zhou*, Qingru Zhang*, Guansong Lu, Hongwei Wang, Weinan Zhang, Yong Yu

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we empirically study the proposed method and compare them with Adam, AMSGrad and SGD, on various tasks in terms of training performance and generalization.
Researcher Affiliation Academia Zhiming Zhou , Qingru Zhang , Guansong Lu, Hongwei Wang, Weinan Zhang, Yong Yu Shanghai Jiao Tong University
Pseudocode Yes Algorithm 1 Ada Shift: Temporal Shifting with Block-wise Spatial Operation; Algorithm 2 Ada Shift: We use a first-in-first-out queue Q to denote the averaging window with the length of n.
Open Source Code Yes The anonymous code is provided at http://bit.ly/2NDXX6x.
Open Datasets Yes We further compare the proposed method with Adam, AMSGrad and SGD by using Logistic Regression and Multilayer Perceptron on MNIST... We test our algorithm with Res Net and Dense Net on CIFAR-10 datasets... We further increase the complexity of dataset, switching from CIFAR-10 to Tiny-Image Net.
Dataset Splits No The paper does not explicitly state the train/validation/test splits for the datasets used (e.g., MNIST, CIFAR-10, Tiny-ImageNet), although standard splits are often implied for these common benchmarks.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU, GPU models, or memory) used for running its experiments. It mentions a 'Tensorflow implementation' in the context of the provided code, and general terms like 'ill-conditioned quadratic problem' that might be run on 'HPC Resource' but without specifics.
Software Dependencies No The paper mentions a 'Tensorflow implementation' but does not specify its version or any other software dependencies with version numbers.
Experiment Setup Yes Here, we list all hyper-parameter setting of all above experiments. Table 5: Hyper-parameter setting of logistic regression in Figure 2. Optimizer learning rate β1 β2 n SGD 0.1 N/A N/A N/A Adam 0.001 0 0.999 N/A AMSGrad 0.001 0 0.999 N/A non-Ada Shift 0.001 0 0.999 1 max-Ada Shift 0.01 0 0.999 1