Meta-Learning Representations for Continual Learning

Authors: Khurram Javed, Martha White

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we investigate the question: can we learn a representation for continual learning that promotes future learning and reduces interference? We investigate this question by meta-learning the representations offline on a meta-training dataset. At meta-test time, we initialize the continual learner with this representation and measure prediction error as the agent learns the PLN online on a new set of CLP problems (See Figure 1). We evaluate on a simulated regression problem and a sequential classification problem using real data.
Researcher Affiliation Academia Khurram Javed, Martha White Department of Computing Science University of Alberta T6G 1P8 kjaved@ualberta.ca, whitem@ualberta.ca
Pseudocode Yes Algorithm 1: Meta-Training : MAML-Rep Algorithm 2: Meta-Training : OML
Open Source Code Yes Code accompanying paper available at https://github.com/khurramjaved96/mrcl
Open Datasets Yes Omniglot is a dataset of over 1623 characters from 50 different alphabets (Lake et al., 2015).
Dataset Splits Yes For each of the methods, we separately tune the learning rate on a five validation trajectories and report results for the best performing parameter.
Hardware Specification No The paper does not provide specific hardware details (e.g., exact GPU/CPU models, memory amounts) used for running its experiments.
Software Dependencies No The paper mentions using Adam (Kingma and Ba, 2014) for optimizing the OML objective, but does not provide specific version numbers for software dependencies or libraries.
Experiment Setup Yes We use SGD on the MSE loss with a mini-batch size of 8 for online updates, and Adam (Kingma and Ba, 2014) for optimizing the OML objective. At evaluation time, we use the same learning rate as used during the inner updates in the meta-training phase for OML. For our baselines, we do a grid search over learning rates and report the results for the best performing parameter. We use six layers for the RLN and two layers for the PLN. Each hidden layer has a width of 300.