On Graduated Optimization for Stochastic Non-Convex Problems

Authors: Elad Hazan, Kfir Yehuda Levy, Shai Shalev-Shwartz

ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments support the theoretical guarantees, substantiating an accelerated convergence in training the NN. Moreover, we demonstrate a non-convex phenomena that exists in natural data, and is captured by the σ-nice property. Section 7. Experiments. As a test case, we train a NN with a single hidden layer of 30 units over the MNIST data set.
Researcher Affiliation Academia Elad Hazan EHAZAN@CS.PRINCETON.EDU Princeton University; Kfir Y. Levy KFIRYL@TX.TECHNION.AC.IL Technion Israel Institute of Technology; Shai Shalev-Shwartz SHAIS@CS.HUJI.AC.IL The Hebrew University of Jerusalem, Israel
Pseudocode Yes Figure 1. Smoothed gradient oracle given gradient feedback. Figure 2. Smoothed gradient oracle given value feedback. Algorithm 1 Grad Opt G. Algorithm 2 Suffix-SGD. Algorithm 3 Grad Opt V.
Open Source Code No The paper does not provide any explicit statement about making the source code available or include links to a code repository.
Open Datasets Yes As a test case, we train a NN with a single hidden layer of 30 units over the MNIST data set. We adopt the experimental setup of (Dauphin et al., 2014) and train over a down-scaled version of the data, i.e., the original 28 28 images of MNIST were down-sampled to the size of 10 10.
Dataset Splits No The paper mentions using the MNIST dataset for training and evaluation but does not explicitly provide specific training/validation/test dataset splits, percentages, or absolute sample counts for each split.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models, memory, or cloud computing instance types used for running experiments. It only mentions general training parameters like 'using a batch size of 100'.
Software Dependencies No The paper mentions using a 'ReLU activation function' and minimizing 'square loss', but it does not specify any software dependencies with version numbers (e.g., specific programming languages, libraries, or frameworks like Python, PyTorch, TensorFlow, etc., with their versions).
Experiment Setup Yes We train a NN with a single hidden layer of 30 units over the MNIST data set. We adopt the experimental setup of (Dauphin et al., 2014) and train over a down-scaled version of the data, i.e., the original 28 28 images of MNIST were down-sampled to the size of 10 10. We use a Re LU activation function, and minimize the square loss. We started by running MSGD (Minibatch Stochastic Gradient Descent) on the problem, using a batch size of 100, and a step size rule of ηt = η0(1 + γt) 3/4, where η0 = 0.01, γ = 10 4. This choice of step size rule was the most effective among a grid of rules that we examined.