reproducibilityindex.ai

SGD and Hogwild! Convergence Without the Bounded Gradients Assumption

Authors: Lam Nguyen, PHUONG HA NGUYEN, Marten Dijk, Peter Richtarik, Katya Scheinberg, Martin Takac

ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4. Numerical Experiments For our numerical experiments, we consider the ﬁnite sum minimization problem in (2). We conducted experiments on a single core for Algorithm 2 on two popular datasets ijcnn1 (n = 91, 701 training data) and covtype (n = 406, 709 training data) from the LIBSVM2 website. Since we are interested in the expected convergence rate with respect to the number of iterations, respectively number of single position vector updates, we do not need a parallelized multi-core simulation to conﬁrm our analysis. The impact of efﬁcient resource scheduling 2http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/ over multiple cores leads to a performance improvement complementary to our analysis of (10) (which, as discussed, lends itself for an efﬁcient parallelized implementation). We experimented with 10 runs and reported the average results. We choose the step size based on Theorem 5, i.e, ηt = 4 µ(t+E) and E = max{2τ, 16LD µ }. For each fraction v {1, 3/4, 2/3, 1/2, 1/3, 1/4} we performed the following experiment: In Algorithm 2 we choose each ﬁlter matrix Sξt ut to correspond with a random subset of size v\|Dξt\| of the non-zero positions of Dξt (i.e., the support of the gradient corresponding to ξt). In addition we use τ = 10. For the two datasets, Figures 1 and 3 plot the training loss for each fraction with τ = 10. The top plots have t , the number of coordinate updates, for the horizontal axis. The bottom plots have the number of epochs, each epoch counting n iterations, for the horizontal axis. The results show that each fraction shows a sublinear expected convergence rate of O(1/t ); the smaller fractions exhibit larger deviations but do seem to converge faster to the minimum solution. In Figures 2 and 4, we show experiments with different values of τ {1, 10, 100} where we use the whole non-zero set of gradient positions (i.e., v = 1) for the update. Our analysis states that, for t = 50 epochs times n iterations per epoch, τ can be as large as p t L(t) = 524 for ijcnn1 and 1058 for covtype. The experiments indeed show that τ 100 has little effect on the expected convergence rate.
Researcher Affiliation	Collaboration	1Department of Industrial and Systems Engineering, Lehigh University, USA. 2IBM Thomas J. Watson Research Center, USA. 3Department of Electrical and Computer Engineering, University of Connecticut, USA. 4KAUST, KSA Edinburgh, UK MIPT, Russia.
Pseudocode	Yes	Algorithm 1 Stochastic Gradient Descent (SGD) Method Initialize: w0 Iterate: for t = 0, 1, 2, . . . do Choose a step size (i.e., learning rate) ηt > 0. Generate a random variable ξt. Compute a stochastic gradient f(wt; ξt). Update the new iterate wt+1 = wt ηt f(wt; ξt). end for
Open Source Code	No	The paper does not provide a statement about releasing the source code for the methodology or a link to a code repository.
Open Datasets	Yes	We conducted experiments on a single core for Algorithm 2 on two popular datasets ijcnn1 (n = 91, 701 training data) and covtype (n = 406, 709 training data) from the LIBSVM2 website. 2http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/
Dataset Splits	No	The paper mentions 'ijcnn1 (n = 91, 701 training data)' and 'covtype (n = 406, 709 training data)' and plots 'training loss' but does not specify any dataset splits for training, validation, or testing.
Hardware Specification	No	The paper states 'We conducted experiments on a single core' but provides no specific hardware details such as exact GPU/CPU models, processor types, or memory amounts.
Software Dependencies	No	The paper does not provide specific ancillary software details, such as library or solver names with version numbers, needed to replicate the experiment.
Experiment Setup	Yes	We choose the step size based on Theorem 5, i.e, ηt = 4 µ(t+E) and E = max{2τ, 16LD µ }. For each fraction v {1, 3/4, 2/3, 1/2, 1/3, 1/4} we performed the following experiment: In Algorithm 2 we choose each ﬁlter matrix Sξt ut to correspond with a random subset of size v\|Dξt\| of the non-zero positions of Dξt (i.e., the support of the gradient corresponding to ξt). In addition we use τ = 10.