Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Open-Ended Reinforcement Learning with Neural Reward Functions
Authors: Robert Meier, Asier Mujika
NeurIPS 2022 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We empirically test our framework in a diverse set of environments. First, we apply it to a simple 2d navigation task. Then, we apply it to three robotic environments... Finally, we apply it on the challenging Montezuma s Revenge Atari game... (From Introduction) Table 1: Zero-shot environment reward of our algorithm and the number of steps a supervised PPO agents needs to match it. Both columns averaged over 10 repetitions. |
| Researcher Affiliation | Academia | Robert Meier Department of Computer Science ETH Z urich Z urich, Switzerland EMAIL Asier Mujika Department of Computer Science ETH Z urich Z urich, Switzerland EMAIL |
| Pseudocode | Yes | Putting everything together we get an algorithm which learns reward functions that encode increasingly complex behaviors and learns RL agents that solve those reward functions. Figure 1 illustrates the main steps of our training loop and see Algorithm 1 in the appendix for more detail. |
| Open Source Code | Yes | The implementation of our approach can be found here. (From Abstract) ... We believe this code is useful for the RL community on its own and provide it in the supplementary material. (Section 4.1) ... Included code, we will release it under the Apache 2 license or similar (From Checklist) |
| Open Datasets | Yes | Then, in Section 4.2, we move to BRAX robotic environments (Freeman et al., 2021). ... Finally, in Section 4.3, we apply our method to Montezuma s Revenge Atari game. |
| Dataset Splits | No | The paper does not explicitly detail training, validation, or test dataset splits. It only refers to "training process" or "training" in a general sense. |
| Hardware Specification | Yes | Using just one NVIDIA RTX 3090 GPU, the training process runs at over one million frames per second which enables training of agents in just a few seconds. |
| Software Dependencies | No | Inspired by the BRAX library (Freeman et al., 2021), we implemented both the environment and an A2C agent inside a single JAX (Bradbury et al., 2018) compiled function. ... To train our agent we use the Advantage Actor Critic (A2C) (Mnih et al., 2016) algorithm. It mentions JAX but not its version, and other components are algorithms or frameworks without specific version numbers. |
| Experiment Setup | Yes | For the exact hyper-parameters see Table 3 in Appendix B. (Section 4.1) ... One key parameter when training actor critic methods is entropy regularization. ... We trained two policies with a fixed entropy regularization of 0.0025 (a) and 0.035 (b). (Section 4.1.2) |