reproducibilityindex.ai

Open-Ended Reinforcement Learning with Neural Reward Functions

Authors: Robert Meier, Asier Mujika

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically test our framework in a diverse set of environments. First, we apply it to a simple 2d navigation task. Then, we apply it to three robotic environments... Finally, we apply it on the challenging Montezuma s Revenge Atari game... (From Introduction) Table 1: Zero-shot environment reward of our algorithm and the number of steps a supervised PPO agents needs to match it. Both columns averaged over 10 repetitions.
Researcher Affiliation	Academia	Robert Meier Department of Computer Science ETH Z urich Z urich, Switzerland romeier@inf.ethz.ch Asier Mujika Department of Computer Science ETH Z urich Z urich, Switzerland asierm@inf.ethz.ch
Pseudocode	Yes	Putting everything together we get an algorithm which learns reward functions that encode increasingly complex behaviors and learns RL agents that solve those reward functions. Figure 1 illustrates the main steps of our training loop and see Algorithm 1 in the appendix for more detail.
Open Source Code	Yes	The implementation of our approach can be found here. (From Abstract) ... We believe this code is useful for the RL community on its own and provide it in the supplementary material. (Section 4.1) ... Included code, we will release it under the Apache 2 license or similar (From Checklist)
Open Datasets	Yes	Then, in Section 4.2, we move to BRAX robotic environments (Freeman et al., 2021). ... Finally, in Section 4.3, we apply our method to Montezuma s Revenge Atari game.
Dataset Splits	No	The paper does not explicitly detail training, validation, or test dataset splits. It only refers to "training process" or "training" in a general sense.
Hardware Specification	Yes	Using just one NVIDIA RTX 3090 GPU, the training process runs at over one million frames per second which enables training of agents in just a few seconds.
Software Dependencies	No	Inspired by the BRAX library (Freeman et al., 2021), we implemented both the environment and an A2C agent inside a single JAX (Bradbury et al., 2018) compiled function. ... To train our agent we use the Advantage Actor Critic (A2C) (Mnih et al., 2016) algorithm. It mentions JAX but not its version, and other components are algorithms or frameworks without specific version numbers.
Experiment Setup	Yes	For the exact hyper-parameters see Table 3 in Appendix B. (Section 4.1) ... One key parameter when training actor critic methods is entropy regularization. ... We trained two policies with a fixed entropy regularization of 0.0025 (a) and 0.035 (b). (Section 4.1.2)