Stochastic Variance Reduction for Nonconvex Optimization

Authors: Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, Alex Smola

ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We present our empirical results in this section. In particular, we study multiclass classification using neural networks. This is typical nonconvex problem encountered in machine learning. Experimental Setup. We train neural networks with one fully-connected hidden layer of 100 nodes and 10 softmax output nodes. We use 2-regularization for training. We use CIFAR-10, MNIST, and STL-10 datasets for our experiments. Figure 1 shows the results.
Researcher Affiliation Academia Machine Learning Department, School of Computer Science, Carnegie Mellon University Laboratory for Information & Decision Systems, Massachusetts Institute of Technology
Pseudocode Yes Algorithm 1 SVRG and Algorithm 2 GD-SVRG
Open Source Code No The paper does not contain any explicit statement about releasing the source code or a link to a code repository for the described methodology.
Open Datasets Yes We use CIFAR-10, MNIST, and STL-10 datasets for our experiments. These datasets are standard in the neural networks literature. The features in the datasets are normalized to the interval [0, 1]. All the datasets come with a predefined split into training and test datasets.
Dataset Splits No All the datasets come with a predefined split into training and test datasets.
Hardware Specification No No specific hardware details (like GPU/CPU models, memory, or cloud instances) used for running the experiments are mentioned.
Software Dependencies No No specific software dependencies with version numbers are mentioned in the paper.
Experiment Setup Yes We train neural networks with one fully-connected hidden layer of 100 nodes and 10 softmax output nodes. We use 2-regularization for training. The 2 regularization is 1e-3 for CIFAR-10 and MNIST, and 1e-2 for STL-10. The step size is critical for SGD; we set it using the popular t-inverse schedule t = 0(1+ 0bt/nc) 1, where 0 and 0 are chosen so that SGD gives the best performance on the training loss. In our experiments, we also use 0 = 0; this results in a fixed step size for SGD. For SVRG, we use a fixed step size as suggested by our analysis. Again, the step size is chosen so that SVRG gives the best performance on the training loss. Initialization & mini-batching. Initialization is critical to training of neural networks. We use the normalized initialization in (Glorot & Bengio, 2010)... We use mini-batches of size 10 in our experiments... we use an epoch size m = n/10 in our experiments.