Nesterov Accelerated Shuffling Gradient Method for Convex Optimization

Authors: Trang H Tran, Katya Scheinberg, Lam M Nguyen

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Numerical simulations demonstrate the efficiency of our algorithm.
Researcher Affiliation Collaboration 1School of Operations Research and Information Engineering, Cornell University, Ithaca, NY, USA. 2IBM Research, Thomas J. Watson Research Center, Yorktown Heights, NY, USA.
Pseudocode Yes Algorithm 2 Nesterov Accelerated Shuffling Gradient (NASG) Method
Open Source Code Yes Our code can be found at the repository https://github. com/htt-trangtran/nasg.
Open Datasets Yes We have conducted the experiments on three classification datasets w8a (49, 749 samples), ijcnn1 (91, 701 samples) and covtype (406709 samples) from LIBSVM (Chang & Lin, 2011). ... We test our algorithm using linear neural networks on three well-known image classification datasets: MNIST dataset (Le Cun et al., 1998) and Fashion-MNIST dataset (Xiao et al., 2017) both with 60, 000 samples, and finally CIFAR-10 dataset (Krizhevsky & Hinton, 2009) with 50, 000 images.
Dataset Splits Yes At the tuning stage, we test each method for 20 epochs. We run every algorithm with a constant learning rate where the learning rates follows a grid search and select the ones that perform best according to their results.
Hardware Specification No The paper does not explicitly describe the specific hardware (e.g., CPU, GPU models, or memory) used to run its experiments.
Software Dependencies Yes All the algorithms are implemented in Python using Py Torch package (Paszke et al., 2019).
Experiment Setup Yes The minibatch size is 256. ... We tune each algorithm using constant learning rate and report the best final results. ... For SGD and NASG the searching grid is {1, 0.5, 0.1, 0.05, 0.01, 0.005, 0.001}. ... For SGD-M, ... Note that this momentum update is implemented in Py Torch with the default value β = 0.9. ... For Adam, we fixed two hyper-parameters β1 := 0.9, β2 := 0.999 as in the original paper. Since the default learning rate for Adam is 0.001, we let our searching grid be {0.005, 0.001, 0.0005}.