Non-convex Finite-Sum Optimization Via SCSG Methods
Authors: Lihua Lei, Cheng Ju, Jianbo Chen, Michael I. Jordan
NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical experiments demonstrate that SCSG outperforms stochastic gradient methods on training multi-layers neural networks in terms of both training and validation loss. |
| Researcher Affiliation | Academia | Lihua Lei UC Berkeley lihua.lei@berkeley.edu Cheng Ju UC Berkeley cju@berkeley.edu Jianbo Chen UC Berkeley jianbochen@berkeley.edu Michael I. Jordan UC Berkeley jordan@stat.berkeley.edu |
| Pseudocode | Yes | Algorithm 1 (Mini-Batch) Stochastically Controlled Stochastic Gradient (SCSG) method for smooth non-convex finite-sum objectives |
| Open Source Code | Yes | Our code is available at https://github.com/Jianbo-Lab/SCSG. |
| Open Datasets | Yes | We evaluate SCSG and mini-batch SGD on the MNIST dataset |
| Dataset Splits | No | The paper mentions "training and validation loss" in Figure 1, implying the use of a validation set. However, it only explicitly states the size for the training (50,000) and test (10,000) examples, and does not provide specific details on how the validation set was created or its size. |
| Hardware Specification | Yes | All experiments were carried out on an Amazon p2.xlarge node with a NVIDIA GK210 GPU |
| Software Dependencies | Yes | algorithms implemented in Tensor Flow 1.0. |
| Experiment Setup | Yes | We initialized parameters by Tensor Flow s default Xavier uniform initializer. In all experiments below, we show the results corresponding to the best-tuned stepsizes. We consider three algorithms: (1) SGD with a fixed batch size B {512, 1024}; (2) SCSG with a fixed batch size B {512, 1024} and a fixed mini-batch size b = 32; (3) SCSG with time-varying batch sizes Bj = j3/2 n and bj = Bj/32 . To be clear, given T epochs, the IFO complexity of the three algorithms are TB, 2TB and 2 PT j=1 Bj, respectively. We run each algorithm with 20 passes of data. |