Universal Reinforcement Learning Algorithms: Survey and Experiments
Authors: John Aslanides, Jan Leike, Marcus Hutter
IJCAI 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We present a short and accessible survey of these URL algorithms under a unified notation and framework, along with results of some experiments that qualitatively illustrate some properties of the resulting policies, and their relative performance on partially-observable gridworld environments. We also present an opensource reference implementation of the algorithms which we hope will facilitate further understanding of, and experimentation with, these ideas. ... The contribution of this paper is three-fold: we present a survey of these URL algorithms, and unify their presentation under a consistent vocabulary. We illuminate these agents with an empirical investigation into their behavior and properties. |
| Researcher Affiliation | Collaboration | John Aslanides , Jan Leike , Marcus Hutter Australian National University Future of Humanity Institute, University of Oxford {john.aslanides, marcus.hutter}@anu.edu.au, leike@google.com |
| Pseudocode | Yes | Algorithm 1 Bayesian URL agent [Hutter, 2005]; Algorithm 2 Bayes Exp [Lattimore, 2013]; Algorithm 3 Thompson Sampling [Leike et al., 2016]; Algorithm 4 MDL Agent [Lattimore and Hutter, 2011]. |
| Open Source Code | Yes | Our third contribution is to present a portable and extensible open-source software framework1 for experimenting with, and demonstrating, URL algorithms. ... 1The framework is named AIXIJS; the source code can be found at http://github.com/aslanides/aixijs. |
| Open Datasets | No | We run our experiments on a class of partially-observable gridworlds. ... One might consider constructing M by enumerating all N N gridworlds of the class described above, but this is infeasible as the size of such an enumeration explodes combinatorially... Instead, we choose a discrete parametrization D that enumerates an interesting subset of these gridworlds. ... We use this to create the first of our model classes, Mloc. The second, MDirichlet, uses a factorized distribution rather than an explicit mixture to avoid this issue. |
| Dataset Splits | No | The paper describes the simulated gridworld environments and models used, but does not specify train/validation/test dataset splits. The experiments involve agent-environment interaction in a simulated setting rather than static dataset processing. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., CPU, GPU models, or memory) used for running the experiments. |
| Software Dependencies | No | The paper mentions 'AIXIJS' as their framework but does not specify any software names with version numbers for dependencies (e.g., Python, PyTorch, etc.). |
| Experiment Setup | Yes | Except where otherwise stated, the following experiments were run on a 10 10 gridworld with a single dispenser with θ = 0.75. We average training score over 50 simulations for each agent configuration, and we use κ = 600 MCTS samples and a planning horizon of m = 6. In all cases, discounting is geometric with γ = 0.99. In all cases, the agents are initialized with a uniform prior. |