Meta-Learning Update Rules for Unsupervised Representation Learning

Authors: Luke Metz, Niru Maheswaranathan, Brian Cheung, Jascha Sohl-Dickstein

ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 5 EXPERIMENTAL RESULTS
Researcher Affiliation Collaboration Luke Metz Google Brain lmetz@google.com Niru Maheswaranathan Google Brain nirum@google.com Brian Cheung University of California, Berkeley bcheung@berkeley.edu Jascha Sohl-Dickstein Google Brain jaschasd@google.com
Pseudocode Yes Algorithm 1: Distributed Training Algorithm
Open Source Code Yes Additionally, code and meta-trained parameters θ for our meta-learned Unsupervised Update is available1. 1https://github.com/tensorflow/models/tree/master/research/learning_ unsupervised_learning
Open Datasets Yes We construct a set of training tasks consisting of CIFAR10 (Krizhevsky and Hinton, 2009) and multi-class classification from subsets of classes from Imagenet (Russakovsky et al., 2015)... For evaluation, we use MNIST (Le Cun et al., 1998), Fashion MNIST (Xiao et al., 2017), IMDB (Maas et al., 2011)...
Dataset Splits Yes Our train set consists of Mini Imagenet, Alphabet, and Mini CIFAR. Our test sets are Mini Imagenet Test, Tiny Fashion MNIST, Tiny MNIST and IMDB. ... In order to encourage the learning of features that generalize well, we estimate the linear regression weights on one minibatch {xa, ya} of K data points, and evaluate the classification performance on a second minibatch {xb, yb} also with K datapoints
Hardware Specification No Due to the small base models and the sequential nature of our compute workloads, we use multi core CPUs as opposed to GPUs.
Software Dependencies No We implement the above models in distributed Tensor Flow (Abadi et al., 2016).
Experiment Setup Yes We sample number of layers uniformly between 2-5 and the number of units per layer logarithmically between 64 to 512. ... Training takes 8 days, and consists of 200 thousand updates to θ with minibatch size 256. ... We use a learning rate schedule of 3e-4 for the first 100k steps, then 1e-4 for next 50k steps, then 2e-5 for remainder of meta-training. We use gradient clipping of norm 5 on minibatchs of size 256. We compute our meta-objective by averaging 5 evaluation of the linear regression. We use a ridge penalty of 0.1 for all this work.