Meta-Learning Update Rules for Unsupervised Representation Learning
Authors: Luke Metz, Niru Maheswaranathan, Brian Cheung, Jascha Sohl-Dickstein
ICLR 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 5 EXPERIMENTAL RESULTS |
| Researcher Affiliation | Collaboration | Luke Metz Google Brain lmetz@google.com Niru Maheswaranathan Google Brain nirum@google.com Brian Cheung University of California, Berkeley bcheung@berkeley.edu Jascha Sohl-Dickstein Google Brain jaschasd@google.com |
| Pseudocode | Yes | Algorithm 1: Distributed Training Algorithm |
| Open Source Code | Yes | Additionally, code and meta-trained parameters θ for our meta-learned Unsupervised Update is available1. 1https://github.com/tensorflow/models/tree/master/research/learning_ unsupervised_learning |
| Open Datasets | Yes | We construct a set of training tasks consisting of CIFAR10 (Krizhevsky and Hinton, 2009) and multi-class classification from subsets of classes from Imagenet (Russakovsky et al., 2015)... For evaluation, we use MNIST (Le Cun et al., 1998), Fashion MNIST (Xiao et al., 2017), IMDB (Maas et al., 2011)... |
| Dataset Splits | Yes | Our train set consists of Mini Imagenet, Alphabet, and Mini CIFAR. Our test sets are Mini Imagenet Test, Tiny Fashion MNIST, Tiny MNIST and IMDB. ... In order to encourage the learning of features that generalize well, we estimate the linear regression weights on one minibatch {xa, ya} of K data points, and evaluate the classification performance on a second minibatch {xb, yb} also with K datapoints |
| Hardware Specification | No | Due to the small base models and the sequential nature of our compute workloads, we use multi core CPUs as opposed to GPUs. |
| Software Dependencies | No | We implement the above models in distributed Tensor Flow (Abadi et al., 2016). |
| Experiment Setup | Yes | We sample number of layers uniformly between 2-5 and the number of units per layer logarithmically between 64 to 512. ... Training takes 8 days, and consists of 200 thousand updates to θ with minibatch size 256. ... We use a learning rate schedule of 3e-4 for the first 100k steps, then 1e-4 for next 50k steps, then 2e-5 for remainder of meta-training. We use gradient clipping of norm 5 on minibatchs of size 256. We compute our meta-objective by averaging 5 evaluation of the linear regression. We use a ridge penalty of 0.1 for all this work. |