reproducibilityindex.ai

Memory-Based Meta-Learning on Non-Stationary Distributions

Authors: Tim Genewein, Gregoire Deletang, Anian Ruoss, Li Kevin Wenliang, Elliot Catt, Vincent Dutordoir, Jordi Grau-Moya, Laurent Orseau, Marcus Hutter, Joel Veness

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We investigate these questions empirically in Section 7. The general approach for our experiments is to train various memory-based neural models according to the MBML training setup described in Section 3. We explore multiple neural architectures to get a better sense as to how architectural features influence the quality of the meta-learned Bayesian approximation. After training, we evaluate models either on data drawn from the same meta-distribution as during training (on-distribution experiments) or from a different distribution (off-distribution experiments). We quantify prediction performance by the expected cumulative regret (called redundancy in information theory) with respect to the ground-truth piecewise data generating source µ, quantifying the expected excess log loss of the neural predictor.
Researcher Affiliation	Collaboration	1DeepMind 2University of Cambridge. Correspondence to: Tim Genewein <timgen@deepmind.com>, Gr egoire Del etang <gdelt@deepmind.com>, Anian Ruoss <anianr@deepmind.com>.
Pseudocode	Yes	Algorithm 1 TPSd(o) Algorithm 2 LIN-PRIOR-SAMPLE(n)
Open Source Code	Yes	Source code available at: https://github.com/ deepmind/nonstationary_mbml.
Open Datasets	No	The paper describes a process for generating data from specific priors (e.g., PTW prior, LIN prior, Regular Periodic, Random Uniform) rather than using a pre-existing public dataset. While the data generation process is detailed and code is provided, there is no explicit link, DOI, or citation to a publicly accessible pre-generated dataset.
Dataset Splits	No	The paper describes how data is generated for training and evaluation. It discusses "on-distribution experiments" and "off-distribution experiments" but does not specify exact train/validation/test splits (e.g., percentages or absolute counts) of a fixed dataset. Data is generated on the fly for these purposes.
Hardware Specification	No	We ran each distribution-architecture-hyperparameter triplet on a single GPU on our internal cluster.
Software Dependencies	No	During training, parameters are updated via mini-batch stochastic gradient descent using ADAM. The paper mentions the ADAM optimizer but does not provide specific version numbers for any software components or libraries (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	We conducted an initial ablation study to determine architecture hyperparameters (see Appendix A). The experimental results shown in Section 7 use the hyperparameter-set that led to the lowest expected cumulative redundancy in the ablations (we provide the exact values in Appendix A). Appendix A details parameters such as "hidden sizes: 64, 128 and 256", "number of dense layers", "stack size (1, 8 or total sequence length, e.g., 256) and the stack cell width (1, 2 and 8 dimensions)", "embedding size of 64 and 8 heads", and "number of layers: 2, 4, 8 and 16" for different network architectures.