Distributed Stochastic Optimization via Adaptive SGD

Authors: Ashok Cutkosky, Róbert Busa-Fekete

NeurIPS 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We implement our algorithm in the Spark distributed framework and exhibit dramatic performance gains on large-scale logistic regression problems. and To verify our theoretical results, we carried out experiments on large-scale (order 100 million datapoints) public datasets...
Researcher Affiliation Collaboration Ashok Cutkosky Stanford University, USA cutkosky@google.com and Róbert Busa-Fekete Yahoo! Research, New York, USA busafekete@oath.com and now at Google
Pseudocode Yes Algorithm 1 SVRG OL (SVRG with Online Learning)
Open Source Code No No explicit statement providing access to the source code for the methodology described in this paper was found.
Open Datasets Yes To verify our theoretical results, we carried out experiments on large-scale (order 100 million datapoints) public datasets, such as KDD10 and KDD12 3 and 3https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html
Dataset Splits Yes The main statistics of the datasets are shown in Table 2. and We measure the number of communication rounds, the total training error, the error on a held-out test set, the Area Under the Curve (AUC), and total runtime in minutes.
Hardware Specification No No specific hardware details (such as GPU/CPU models, memory, or cloud instance types) used for running the experiments were provided. The paper only mentions 'Spark distributed framework'.
Software Dependencies Yes We tested two well-know scalable logistic regression implementation: Spark ML 2.2.0 and Vowpal Wabbit 7.10.0 (VW)
Experiment Setup Yes Our theoretical analysis asks for exponentially increasing serial phase lengths Tk and a batch size of of ˆN = T 2. In practice we use slightly different settings. We have a constant serial phase length Tk = T0 for all k, and an increasing batch size ˆNk = k C for some constant C. We usually set C = T0. and We initially divide the training data into C approximately 100M chunks, and we use min(1000, C) executors.