Open-Ended Reinforcement Learning with Neural Reward Functions
Authors: Robert Meier, Asier Mujika
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically test our framework in a diverse set of environments. First, we apply it to a simple 2d navigation task. Then, we apply it to three robotic environments... Finally, we apply it on the challenging Montezuma s Revenge Atari game... (From Introduction) Table 1: Zero-shot environment reward of our algorithm and the number of steps a supervised PPO agents needs to match it. Both columns averaged over 10 repetitions. |
| Researcher Affiliation | Academia | Robert Meier Department of Computer Science ETH Z urich Z urich, Switzerland romeier@inf.ethz.ch Asier Mujika Department of Computer Science ETH Z urich Z urich, Switzerland asierm@inf.ethz.ch |
| Pseudocode | Yes | Putting everything together we get an algorithm which learns reward functions that encode increasingly complex behaviors and learns RL agents that solve those reward functions. Figure 1 illustrates the main steps of our training loop and see Algorithm 1 in the appendix for more detail. |
| Open Source Code | Yes | The implementation of our approach can be found here. (From Abstract) ... We believe this code is useful for the RL community on its own and provide it in the supplementary material. (Section 4.1) ... Included code, we will release it under the Apache 2 license or similar (From Checklist) |
| Open Datasets | Yes | Then, in Section 4.2, we move to BRAX robotic environments (Freeman et al., 2021). ... Finally, in Section 4.3, we apply our method to Montezuma s Revenge Atari game. |
| Dataset Splits | No | The paper does not explicitly detail training, validation, or test dataset splits. It only refers to "training process" or "training" in a general sense. |
| Hardware Specification | Yes | Using just one NVIDIA RTX 3090 GPU, the training process runs at over one million frames per second which enables training of agents in just a few seconds. |
| Software Dependencies | No | Inspired by the BRAX library (Freeman et al., 2021), we implemented both the environment and an A2C agent inside a single JAX (Bradbury et al., 2018) compiled function. ... To train our agent we use the Advantage Actor Critic (A2C) (Mnih et al., 2016) algorithm. It mentions JAX but not its version, and other components are algorithms or frameworks without specific version numbers. |
| Experiment Setup | Yes | For the exact hyper-parameters see Table 3 in Appendix B. (Section 4.1) ... One key parameter when training actor critic methods is entropy regularization. ... We trained two policies with a fixed entropy regularization of 0.0025 (a) and 0.035 (b). (Section 4.1.2) |