Fast and Accurate Stochastic Gradient Estimation

Authors: Beidi Chen, Yingchen Xu, Anshumali Shrivastava

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We demonstrate the effectiveness of our proposal with experiments on linear models as well as the non-linear BERT, which is a recent popular deep learning based language representation model. and 3 Experiments Linear regression is a basic and commonly used supervised machine learning algorithm for prediction. Deep learning models recently become popular for their state-of-the-art performance on Natural Language Processing (NLP) and also Computer Vision tasks. Therefore, we chose both linear regression and deep learning models as the target experiment tasks to examine the effectiveness of our algorithm.
Researcher Affiliation Academia Beidi Chen Rice University Houston, Texas beidi.chen@rice.edu Yingchen Xu Rice University Houston, Texas yx26@rice.edu Anshumali Shrivastava Rice University Houston, Texas anshumali@rice.edu
Pseudocode Yes Algorithm 1: assignment algorithm and Algorithm 2: LSH-Sampled Stochastic gradient Descent (LGD) Algorithm
Open Source Code No The paper does not provide an explicit statement about releasing the source code for their proposed method or a link to a code repository.
Open Datasets Yes Dataset: We used three large regression, Year Prediction MSD [18],Slice [18], UJIIndoor Loc [27], and two NLP benchmarks, MRPC [13], RTE [28].
Dataset Splits No The paper mentions training data and testing data for the datasets used (Figure 4) and parameters like '3 epochs with batch size 32' but does not specify explicit validation set sizes or percentages for dataset splits.
Hardware Specification No We do not explore the time-wise convergence comparison between LGD and SGD in current tasks because BERT is implemented in Tensorflow [1] and Pytorch [21] on GPU. We currently only have the CPU implementation of LSH. - This mentions “GPU” and “CPU” but no specific models or details.
Software Dependencies No We do not explore the time-wise convergence comparison between LGD and SGD in current tasks because BERT is implemented in Tensorflow [1] and Pytorch [21] on GPU. - Software is mentioned but without specific version numbers.
Experiment Setup Yes For each task, we ran fine-tunings for 3 epochs with batch size 32 and used Adam optimizer with initial learning rates 2e. As for LSH parameter, we chose K = 7, L = 10. and We used fixed values K = 5 and L = 100 for all the datasets. l is the number of hash tables that have been searched before landing in a non-empty bucket in a query. In our experiments l is almost always as low as 1. L only affects preprocessing but not sampling. Our hash function was simhash (or signed random projections) and we used sparse random projections with sparsity 1/30 for speed. We tried a sweep of initial step size from 1e-5 to 1e-1 and choose the one that will lead to convergence with LGD and SGD.