The Benefits of Model-Based Generalization in Reinforcement Learning
Authors: Kenny John Young, Aditya Ramesh, Louis Kirsch, Jürgen Schmidhuber
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Here, we provide theoretical and empirical insight into when, and how, we can expect data generated by a learned model to be useful. First, we provide a simple theorem motivating how learning a model as an intermediate step can narrow down the set of possible value functions more than learning a value function directly from data using the Bellman equation. Second, we provide an illustrative example showing empirically how a similar effect occurs in a more concrete setting with neural network function approximation. Finally, we provide extensive experiments showing the benefit of model-based learning for online RL in environments with combinatorial complexity, but factored structure that allows a learned model to generalize. |
| Researcher Affiliation | Academia | 1University of Alberta and the Alberta Machine Intelligence Institute 2The Swiss AI Lab IDSIA/USI/SUPSI 3AI Initiative, KAUST, Thuwal, Saudi Arabia. |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code to reproduce the main experiments is available at: https://github.com/kenjyoung/Model_Generalization_Code_supplement. |
| Open Datasets | No | The paper describes generating data within custom environments (Proc Maze, Button Grid, Pan Flute, Open Grid) and using 'training datasets' or 'model-generated transitions'. It does not state that these specific datasets are publicly available, nor does it refer to pre-existing, publicly accessible datasets with direct access information. |
| Dataset Splits | No | The paper describes an online learning setting where data is generated through interaction with environments, rather than explicit splits of a fixed dataset. It mentions 'evaluating each hyperparameter setting' but does not specify a separate validation dataset split. |
| Hardware Specification | No | The paper states: 'We were able to run 30 seeds efficiently in parallel on a single GPU using automatic batching in JAX (Bradbury et al., 2018).' This only specifies 'a single GPU' without providing any specific model number or other hardware details. |
| Software Dependencies | No | The paper mentions 'JAX (Bradbury et al., 2018)' and 'Optimizer Adam' without specifying version numbers for these software components. Table 1 lists parameters for Adam but not its software version. |
| Experiment Setup | Yes | Table 1: Table of hyperparameters used in experiments in Section 4. (Includes details like Number of Hidden Layers 3, Number of Hidden Units 200, Hidden Activation ELU, Optimizer Adam, Discount Factor 0.9, Batch Size, Target Network Update Frequency, Buffer Size, Model Learning Step-Size, Rollout Length, etc.) |