SADAGRAD: Strongly Adaptive Stochastic Gradient Methods

Authors: Zaiyi Chen, Yi Xu, Enhong Chen, Tianbao Yang

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on large-scale data sets demonstrate the efficiency of the proposed algorithms in comparison with several variants of ADAGRAD and stochastic gradient method.
Researcher Affiliation Academia 1University of Science and Technology of China, China 2The University of Iowa, USA.
Pseudocode Yes Algorithm 1 ADAGRAD(w0, η, λ, ϵ, ϵ0) Algorithm 2 SADAGRAD(w0, θ, λ, ϵ, ϵ0) Algorithm 3 ADAGRAD-PROX(w0, η, λ, ϵ, ϵ0) Algorithm 4 SADAGRAD-PROX(w0, θ, λ, ϵ, ϵ0) Algorithm 5 r SADAGRAD(w0, θ, λ1, ϵ, ϵ0, τ)
Open Source Code No The paper does not provide any concrete access information (e.g., a link or explicit statement of code release) for the source code.
Open Datasets Yes The experiments are performed on four data sets from libsvm (Chang & Lin, 2011) website with different scale of instances and features, namely covtype, epsilon, rcv1,and news20. The statistics of these data sets are shown in Table 1.
Dataset Splits No The paper refers to 'training data examples' but does not specify explicit dataset splits (e.g., percentages or counts) for training, validation, or test sets.
Hardware Specification No The paper does not provide any specific details about the hardware used to run the experiments.
Software Dependencies No The paper mentions several algorithms and tools (e.g., ADAGRAD, ADAM, RASSG, libsvm) but does not provide specific version numbers for the software dependencies used in their implementation.
Experiment Setup Yes The step size of ADAM is tuned in 10[ 2:2], and other parameters are chosen as recommended in the paper. For SC-ADAGRAD, the parameters α and ξ1 in their papers are tuned in 10[ 4:2] and [0.1, 1] respectively. Based on the analysis in the previous sections, the step size parameter θ would influence the convergence speed of both ADAGRAD and SADAGRAD. So we tuned this parameter for both ADAGRAD and SADAGRAD on each data set. We run ADAGRAD a number of iterations (i.e., 5,000) on each dataset and set θ = r 2(γ+maxi gk 1:5000,i 2) Pd i=1 gk 1:5000,i 2 . Besides, we set λ1 = 100λ for solving (9) and λ1 = 100ζ for solving (8) and τ = 1 for r SADAGRAD and r SADAGRAD-PROX.