SGD and Hogwild! Convergence Without the Bounded Gradients Assumption
Authors: Lam Nguyen, PHUONG HA NGUYEN, Marten Dijk, Peter Richtarik, Katya Scheinberg, Martin Takac
ICML 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4. Numerical Experiments For our numerical experiments, we consider the finite sum minimization problem in (2). We conducted experiments on a single core for Algorithm 2 on two popular datasets ijcnn1 (n = 91, 701 training data) and covtype (n = 406, 709 training data) from the LIBSVM2 website. Since we are interested in the expected convergence rate with respect to the number of iterations, respectively number of single position vector updates, we do not need a parallelized multi-core simulation to confirm our analysis. The impact of efficient resource scheduling 2http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/ over multiple cores leads to a performance improvement complementary to our analysis of (10) (which, as discussed, lends itself for an efficient parallelized implementation). We experimented with 10 runs and reported the average results. We choose the step size based on Theorem 5, i.e, ηt = 4 µ(t+E) and E = max{2τ, 16LD µ }. For each fraction v {1, 3/4, 2/3, 1/2, 1/3, 1/4} we performed the following experiment: In Algorithm 2 we choose each filter matrix Sξt ut to correspond with a random subset of size v|Dξt| of the non-zero positions of Dξt (i.e., the support of the gradient corresponding to ξt). In addition we use τ = 10. For the two datasets, Figures 1 and 3 plot the training loss for each fraction with τ = 10. The top plots have t , the number of coordinate updates, for the horizontal axis. The bottom plots have the number of epochs, each epoch counting n iterations, for the horizontal axis. The results show that each fraction shows a sublinear expected convergence rate of O(1/t ); the smaller fractions exhibit larger deviations but do seem to converge faster to the minimum solution. In Figures 2 and 4, we show experiments with different values of τ {1, 10, 100} where we use the whole non-zero set of gradient positions (i.e., v = 1) for the update. Our analysis states that, for t = 50 epochs times n iterations per epoch, τ can be as large as p t L(t) = 524 for ijcnn1 and 1058 for covtype. The experiments indeed show that τ 100 has little effect on the expected convergence rate. |
| Researcher Affiliation | Collaboration | 1Department of Industrial and Systems Engineering, Lehigh University, USA. 2IBM Thomas J. Watson Research Center, USA. 3Department of Electrical and Computer Engineering, University of Connecticut, USA. 4KAUST, KSA Edinburgh, UK MIPT, Russia. |
| Pseudocode | Yes | Algorithm 1 Stochastic Gradient Descent (SGD) Method Initialize: w0 Iterate: for t = 0, 1, 2, . . . do Choose a step size (i.e., learning rate) ηt > 0. Generate a random variable ξt. Compute a stochastic gradient f(wt; ξt). Update the new iterate wt+1 = wt ηt f(wt; ξt). end for |
| Open Source Code | No | The paper does not provide a statement about releasing the source code for the methodology or a link to a code repository. |
| Open Datasets | Yes | We conducted experiments on a single core for Algorithm 2 on two popular datasets ijcnn1 (n = 91, 701 training data) and covtype (n = 406, 709 training data) from the LIBSVM2 website. 2http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ |
| Dataset Splits | No | The paper mentions 'ijcnn1 (n = 91, 701 training data)' and 'covtype (n = 406, 709 training data)' and plots 'training loss' but does not specify any dataset splits for training, validation, or testing. |
| Hardware Specification | No | The paper states 'We conducted experiments on a single core' but provides no specific hardware details such as exact GPU/CPU models, processor types, or memory amounts. |
| Software Dependencies | No | The paper does not provide specific ancillary software details, such as library or solver names with version numbers, needed to replicate the experiment. |
| Experiment Setup | Yes | We choose the step size based on Theorem 5, i.e, ηt = 4 µ(t+E) and E = max{2τ, 16LD µ }. For each fraction v {1, 3/4, 2/3, 1/2, 1/3, 1/4} we performed the following experiment: In Algorithm 2 we choose each filter matrix Sξt ut to correspond with a random subset of size v|Dξt| of the non-zero positions of Dξt (i.e., the support of the gradient corresponding to ξt). In addition we use τ = 10. |