Memory-Based Meta-Learning on Non-Stationary Distributions
Authors: Tim Genewein, Gregoire Deletang, Anian Ruoss, Li Kevin Wenliang, Elliot Catt, Vincent Dutordoir, Jordi Grau-Moya, Laurent Orseau, Marcus Hutter, Joel Veness
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We investigate these questions empirically in Section 7. The general approach for our experiments is to train various memory-based neural models according to the MBML training setup described in Section 3. We explore multiple neural architectures to get a better sense as to how architectural features influence the quality of the meta-learned Bayesian approximation. After training, we evaluate models either on data drawn from the same meta-distribution as during training (on-distribution experiments) or from a different distribution (off-distribution experiments). We quantify prediction performance by the expected cumulative regret (called redundancy in information theory) with respect to the ground-truth piecewise data generating source ยต, quantifying the expected excess log loss of the neural predictor. |
| Researcher Affiliation | Collaboration | 1DeepMind 2University of Cambridge. Correspondence to: Tim Genewein <timgen@deepmind.com>, Gr egoire Del etang <gdelt@deepmind.com>, Anian Ruoss <anianr@deepmind.com>. |
| Pseudocode | Yes | Algorithm 1 TPSd(o) Algorithm 2 LIN-PRIOR-SAMPLE(n) |
| Open Source Code | Yes | Source code available at: https://github.com/ deepmind/nonstationary_mbml. |
| Open Datasets | No | The paper describes a process for *generating* data from specific priors (e.g., PTW prior, LIN prior, Regular Periodic, Random Uniform) rather than using a pre-existing public dataset. While the data generation *process* is detailed and code is provided, there is no explicit link, DOI, or citation to a publicly accessible pre-generated dataset. |
| Dataset Splits | No | The paper describes how data is generated for training and evaluation. It discusses "on-distribution experiments" and "off-distribution experiments" but does not specify exact train/validation/test splits (e.g., percentages or absolute counts) of a fixed dataset. Data is generated on the fly for these purposes. |
| Hardware Specification | No | We ran each distribution-architecture-hyperparameter triplet on a single GPU on our internal cluster. |
| Software Dependencies | No | During training, parameters are updated via mini-batch stochastic gradient descent using ADAM. The paper mentions the ADAM optimizer but does not provide specific version numbers for any software components or libraries (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We conducted an initial ablation study to determine architecture hyperparameters (see Appendix A). The experimental results shown in Section 7 use the hyperparameter-set that led to the lowest expected cumulative redundancy in the ablations (we provide the exact values in Appendix A). Appendix A details parameters such as "hidden sizes: 64, 128 and 256", "number of dense layers", "stack size (1, 8 or total sequence length, e.g., 256) and the stack cell width (1, 2 and 8 dimensions)", "embedding size of 64 and 8 heads", and "number of layers: 2, 4, 8 and 16" for different network architectures. |