Memory-Based Meta-Learning on Non-Stationary Distributions

Authors: Tim Genewein, Gregoire Deletang, Anian Ruoss, Li Kevin Wenliang, Elliot Catt, Vincent Dutordoir, Jordi Grau-Moya, Laurent Orseau, Marcus Hutter, Joel Veness

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We investigate these questions empirically in Section 7. The general approach for our experiments is to train various memory-based neural models according to the MBML training setup described in Section 3. We explore multiple neural architectures to get a better sense as to how architectural features influence the quality of the meta-learned Bayesian approximation. After training, we evaluate models either on data drawn from the same meta-distribution as during training (on-distribution experiments) or from a different distribution (off-distribution experiments). We quantify prediction performance by the expected cumulative regret (called redundancy in information theory) with respect to the ground-truth piecewise data generating source ยต, quantifying the expected excess log loss of the neural predictor.
Researcher Affiliation Collaboration 1DeepMind 2University of Cambridge. Correspondence to: Tim Genewein <timgen@deepmind.com>, Gr egoire Del etang <gdelt@deepmind.com>, Anian Ruoss <anianr@deepmind.com>.
Pseudocode Yes Algorithm 1 TPSd(o) Algorithm 2 LIN-PRIOR-SAMPLE(n)
Open Source Code Yes Source code available at: https://github.com/ deepmind/nonstationary_mbml.
Open Datasets No The paper describes a process for *generating* data from specific priors (e.g., PTW prior, LIN prior, Regular Periodic, Random Uniform) rather than using a pre-existing public dataset. While the data generation *process* is detailed and code is provided, there is no explicit link, DOI, or citation to a publicly accessible pre-generated dataset.
Dataset Splits No The paper describes how data is generated for training and evaluation. It discusses "on-distribution experiments" and "off-distribution experiments" but does not specify exact train/validation/test splits (e.g., percentages or absolute counts) of a fixed dataset. Data is generated on the fly for these purposes.
Hardware Specification No We ran each distribution-architecture-hyperparameter triplet on a single GPU on our internal cluster.
Software Dependencies No During training, parameters are updated via mini-batch stochastic gradient descent using ADAM. The paper mentions the ADAM optimizer but does not provide specific version numbers for any software components or libraries (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We conducted an initial ablation study to determine architecture hyperparameters (see Appendix A). The experimental results shown in Section 7 use the hyperparameter-set that led to the lowest expected cumulative redundancy in the ablations (we provide the exact values in Appendix A). Appendix A details parameters such as "hidden sizes: 64, 128 and 256", "number of dense layers", "stack size (1, 8 or total sequence length, e.g., 256) and the stack cell width (1, 2 and 8 dimensions)", "embedding size of 64 and 8 heads", and "number of layers: 2, 4, 8 and 16" for different network architectures.