Open-Ended Reinforcement Learning with Neural Reward Functions

Authors: Robert Meier, Asier Mujika

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We empirically test our framework in a diverse set of environments. First, we apply it to a simple 2d navigation task. Then, we apply it to three robotic environments... Finally, we apply it on the challenging Montezuma s Revenge Atari game... (From Introduction) Table 1: Zero-shot environment reward of our algorithm and the number of steps a supervised PPO agents needs to match it. Both columns averaged over 10 repetitions.
Researcher Affiliation Academia Robert Meier Department of Computer Science ETH Z urich Z urich, Switzerland romeier@inf.ethz.ch Asier Mujika Department of Computer Science ETH Z urich Z urich, Switzerland asierm@inf.ethz.ch
Pseudocode Yes Putting everything together we get an algorithm which learns reward functions that encode increasingly complex behaviors and learns RL agents that solve those reward functions. Figure 1 illustrates the main steps of our training loop and see Algorithm 1 in the appendix for more detail.
Open Source Code Yes The implementation of our approach can be found here. (From Abstract) ... We believe this code is useful for the RL community on its own and provide it in the supplementary material. (Section 4.1) ... Included code, we will release it under the Apache 2 license or similar (From Checklist)
Open Datasets Yes Then, in Section 4.2, we move to BRAX robotic environments (Freeman et al., 2021). ... Finally, in Section 4.3, we apply our method to Montezuma s Revenge Atari game.
Dataset Splits No The paper does not explicitly detail training, validation, or test dataset splits. It only refers to "training process" or "training" in a general sense.
Hardware Specification Yes Using just one NVIDIA RTX 3090 GPU, the training process runs at over one million frames per second which enables training of agents in just a few seconds.
Software Dependencies No Inspired by the BRAX library (Freeman et al., 2021), we implemented both the environment and an A2C agent inside a single JAX (Bradbury et al., 2018) compiled function. ... To train our agent we use the Advantage Actor Critic (A2C) (Mnih et al., 2016) algorithm. It mentions JAX but not its version, and other components are algorithms or frameworks without specific version numbers.
Experiment Setup Yes For the exact hyper-parameters see Table 3 in Appendix B. (Section 4.1) ... One key parameter when training actor critic methods is entropy regularization. ... We trained two policies with a fixed entropy regularization of 0.0025 (a) and 0.035 (b). (Section 4.1.2)