Learning Synthetic Environments and Reward Networks for Reinforcement Learning
Authors: Fabio Ferreira, Thomas Nierhoff, Andreas Sälinger, Frank Hutter
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our proposed new concept on a broad range of RL algorithms and classic control environments. In a one-to-one comparison, learning an SE proxy requires more interactions with the real environment than training agents only on the real environment. However, once such an SE has been learned, we do not need any interactions with the real environment to train new agents. Moreover, the learned SE proxies allow us to train agents with fewer interactions while maintaining the original task performance. Our empirical results suggest that SEs achieve this result by learning informed representations that bias the agents towards relevant states. |
| Researcher Affiliation | Collaboration | 1 University of Freiburg 2 Bosch Center for Artificial Intelligence |
| Pseudocode | Yes | Algorithm 1: Learning Synthetic Env. with NES |
| Open Source Code | Yes | Our Py Torch (Paszke et al., 2019) code and models are made available publicly.1 1https://github.com/automl/learning environments |
| Open Datasets | Yes | Gym tasks (Brockman et al., 2016) Cart Pole and Acrobot, Cliff Walking (Sutton & Barto, 2018), Mountain Car Continuous-v0 (Brockman et al., 2016) and Half Cheetah-v3 (Todorov & Tassa, 2012) |
| Dataset Splits | No | The paper mentions 'early stopping heuristic' and 'Evaluate Agent' functions which involve testing on the real environment, but it does not specify explicit dataset splits like 'training/validation/test' or 'k-fold cross-validation' in the main text. For example, 'After agent training, we evaluated each agent on the real environment across 10 test episodes'. |
| Hardware Specification | Yes | Each worker had one Intel Xeon Gold 6242 CPU core at its disposal, resulting in an overall runtime of 6-7h on Acrobot and 5-6h on Cart Pole for 200 NES outer loop iterations. |
| Software Dependencies | No | The paper mentions 'Py Torch' but does not provide specific version numbers for any software dependencies, libraries, or programming languages. For example, 'Our Py Torch (Paszke et al., 2019) code and models are made available publicly.' |
| Experiment Setup | Yes | Experimental Setup So far we have described our proposed method on an abstract level and before we start with individual SE experiments we describe the experimental setup. In our work, we refer to the process of optimizing for suitable SEs with Algorithm 1 as SE training and the process of training agents on SEs as agent training). For both SE and agent training on the discrete-actionspace Cart Pole-v0 and Acrobot-v1 environments, we use DDQN (van Hasselt et al., 2016). |