reproducibilityindex.ai

Understanding Short-Horizon Bias in Stochastic Meta-Optimization

Authors: Yuhuai Wu, Mengye Ren, Renjie Liao, Roger Grosse.

ICLR 2018 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We then run meta-optimization experiments (both ofﬂine and online) on standard benchmark datasets, showing that meta-optimization chooses too small a learning rate by multiple orders of magnitude, even when run with a moderately long time horizon (100 steps) typical of work in the area.
Researcher Affiliation	Academia	Yuhuai Wu , Mengye Ren , Renjie Liao & Roger B. Grosse University of Toronto and Vector Institute {ywu, mren, rjliao, rgrosse}@cs.toronto.edu
Pseudocode	Yes	Algorithm 1: Stochastic Meta-Descent
Open Source Code	Yes	Code available at https://github.com/renmengye/meta-optim-public
Open Datasets	Yes	on a multi-layered perceptron (MLP) on MNIST (Le Cun et al., 1998). For CIFAR-10 experiments, we used a CNN network adapted from Caffe (Jia et al., 2014).
Dataset Splits	No	The paper mentions using standard datasets like MNIST and CIFAR-10 and tracking training loss and test error, but does not explicitly provide specific percentages or counts for training, validation, or test splits. It implies standard splits but doesn't state them.
Hardware Specification	No	The paper does not provide specific hardware details (e.g., CPU, GPU models, or cloud instance types) used for running the experiments.
Software Dependencies	No	The paper mentions the use of Adam optimizer and Caffe, but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	We chose a 1000 dimensional quadratic cost function... For the optimized schedules, we minimized the expected loss at time T = 250 using Adam using Adam (Kingma & Ba, 2015), with a learning rate 0.003 and 500 steps. We set an upper bound for the learning rate which prevented the loss component for any dimension from becoming larger than its initial value... We used a parametric learning rate decay schedule known as inverse time decay (Welling & Teh, 2011): αt = α0 (1+ t K )β, where α0 is the initial learning rate, t is the number of training steps, β is the learning rate decay exponent, and K is the time constant. We jointly optimized α0 and β. We ﬁxed µ = 0.9, K = 5000 for simplicity... The network had two layers of 100 hidden units, with Re LU activations. Weights were initialized with a zero-mean Gaussian with standard deviation 0.1. We used a warm start from a network trained for 50 SGD with momentum steps, using α = 0.1, µ = 0.9... For SMD optimization, we trained all hyperparameters in log space using Adam optimizer, with 5k meta steps... Meta-optimization was done with 100 steps of Adam for every 10 steps of regular training. We adapted the learning rate α and momentum µ. After 25k steps, adaptation was stopped, and we trained for another 25k steps with an exponentially decaying learning rate such that it reached 1e-4 on the last time step.