Understanding Short-Horizon Bias in Stochastic Meta-Optimization
Authors: Yuhuai Wu, Mengye Ren, Renjie Liao, Roger Grosse.
ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We then run meta-optimization experiments (both offline and online) on standard benchmark datasets, showing that meta-optimization chooses too small a learning rate by multiple orders of magnitude, even when run with a moderately long time horizon (100 steps) typical of work in the area. |
| Researcher Affiliation | Academia | Yuhuai Wu , Mengye Ren , Renjie Liao & Roger B. Grosse University of Toronto and Vector Institute {ywu, mren, rjliao, rgrosse}@cs.toronto.edu |
| Pseudocode | Yes | Algorithm 1: Stochastic Meta-Descent |
| Open Source Code | Yes | Code available at https://github.com/renmengye/meta-optim-public |
| Open Datasets | Yes | on a multi-layered perceptron (MLP) on MNIST (Le Cun et al., 1998). For CIFAR-10 experiments, we used a CNN network adapted from Caffe (Jia et al., 2014). |
| Dataset Splits | No | The paper mentions using standard datasets like MNIST and CIFAR-10 and tracking training loss and test error, but does not explicitly provide specific percentages or counts for training, validation, or test splits. It implies standard splits but doesn't state them. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU, GPU models, or cloud instance types) used for running the experiments. |
| Software Dependencies | No | The paper mentions the use of Adam optimizer and Caffe, but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | We chose a 1000 dimensional quadratic cost function... For the optimized schedules, we minimized the expected loss at time T = 250 using Adam using Adam (Kingma & Ba, 2015), with a learning rate 0.003 and 500 steps. We set an upper bound for the learning rate which prevented the loss component for any dimension from becoming larger than its initial value... We used a parametric learning rate decay schedule known as inverse time decay (Welling & Teh, 2011): αt = α0 (1+ t K )β, where α0 is the initial learning rate, t is the number of training steps, β is the learning rate decay exponent, and K is the time constant. We jointly optimized α0 and β. We fixed µ = 0.9, K = 5000 for simplicity... The network had two layers of 100 hidden units, with Re LU activations. Weights were initialized with a zero-mean Gaussian with standard deviation 0.1. We used a warm start from a network trained for 50 SGD with momentum steps, using α = 0.1, µ = 0.9... For SMD optimization, we trained all hyperparameters in log space using Adam optimizer, with 5k meta steps... Meta-optimization was done with 100 steps of Adam for every 10 steps of regular training. We adapted the learning rate α and momentum µ. After 25k steps, adaptation was stopped, and we trained for another 25k steps with an exponentially decaying learning rate such that it reached 1e-4 on the last time step. |