On Graduated Optimization for Stochastic Non-Convex Problems
Authors: Elad Hazan, Kfir Yehuda Levy, Shai Shalev-Shwartz
ICML 2016 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments support the theoretical guarantees, substantiating an accelerated convergence in training the NN. Moreover, we demonstrate a non-convex phenomena that exists in natural data, and is captured by the σ-nice property. Section 7. Experiments. As a test case, we train a NN with a single hidden layer of 30 units over the MNIST data set. |
| Researcher Affiliation | Academia | Elad Hazan EHAZAN@CS.PRINCETON.EDU Princeton University; Kfir Y. Levy KFIRYL@TX.TECHNION.AC.IL Technion Israel Institute of Technology; Shai Shalev-Shwartz SHAIS@CS.HUJI.AC.IL The Hebrew University of Jerusalem, Israel |
| Pseudocode | Yes | Figure 1. Smoothed gradient oracle given gradient feedback. Figure 2. Smoothed gradient oracle given value feedback. Algorithm 1 Grad Opt G. Algorithm 2 Suffix-SGD. Algorithm 3 Grad Opt V. |
| Open Source Code | No | The paper does not provide any explicit statement about making the source code available or include links to a code repository. |
| Open Datasets | Yes | As a test case, we train a NN with a single hidden layer of 30 units over the MNIST data set. We adopt the experimental setup of (Dauphin et al., 2014) and train over a down-scaled version of the data, i.e., the original 28 28 images of MNIST were down-sampled to the size of 10 10. |
| Dataset Splits | No | The paper mentions using the MNIST dataset for training and evaluation but does not explicitly provide specific training/validation/test dataset splits, percentages, or absolute sample counts for each split. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models, memory, or cloud computing instance types used for running experiments. It only mentions general training parameters like 'using a batch size of 100'. |
| Software Dependencies | No | The paper mentions using a 'ReLU activation function' and minimizing 'square loss', but it does not specify any software dependencies with version numbers (e.g., specific programming languages, libraries, or frameworks like Python, PyTorch, TensorFlow, etc., with their versions). |
| Experiment Setup | Yes | We train a NN with a single hidden layer of 30 units over the MNIST data set. We adopt the experimental setup of (Dauphin et al., 2014) and train over a down-scaled version of the data, i.e., the original 28 28 images of MNIST were down-sampled to the size of 10 10. We use a Re LU activation function, and minimize the square loss. We started by running MSGD (Minibatch Stochastic Gradient Descent) on the problem, using a batch size of 100, and a step size rule of ηt = η0(1 + γt) 3/4, where η0 = 0.01, γ = 10 4. This choice of step size rule was the most effective among a grid of rules that we examined. |