Meta-trained agents implement Bayes-optimal agents
Authors: Vladimir Mikulik, Grégoire Delétang, Tom McGrath, Tim Genewein, Miljan Martic, Shane Legg, Pedro Ortega
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically investigate this claim on a number of prediction and bandit tasks. [...] Thus, our main contribution is the investigation of the computational structure of RNN-based metalearned solutions. Specifically, we compare the computations of meta-learned agents against the computations of Bayes-optimal agents in terms of their behaviour and internal representations on a set of prediction and reinforcement learning tasks with known optimal solutions. |
| Researcher Affiliation | Industry | Vladimir Mikulik , Grégoire Delétang , Tom Mc Grath , Tim Genewein , Miljan Martic, Shane Legg, Pedro A. Ortega Deep Mind London, UK |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement or link for the open-sourcing of its code. |
| Open Datasets | No | The paper defines custom tasks using specific probability distributions (e.g., Bernoulli, categorical, exponential, Gaussian) from which data is generated, rather than using a pre-existing, publicly available dataset with a specific name and source. |
| Dataset Splits | No | The paper does not specify distinct training, validation, and test splits in a way that would allow direct reproduction of data partitioning for a fixed dataset. It mentions evaluating 'across many checkpoints of a training run' but does not define a dedicated validation set. |
| Hardware Specification | No | The paper does not specify any hardware details (e.g., GPU models, CPU types, memory specifications) used for running the experiments. |
| Software Dependencies | No | The paper mentions software components like BPTT, Adam, and Impala but does not provide specific version numbers for any of them. |
| Experiment Setup | Yes | We selected4 N = 32 for prediction tasks and N = 256 for bandit tasks. Networks were trained with BPTT [23, 24] and Adam [38]. In prediction tasks the loss function is the log-loss of the prediction. In bandit tasks the agents were trained to maximise the return (i.e., the discounted cumulative reward) using the Impala [39] policy gradient algorithm. |