reproducibilityindex.ai

Dynamics-Aware Comparison of Learned Reward Functions

Authors: Blake Wulfe, Logan Michael Ellis, Jean Mercat, Rowan Thomas McAllister, Adrien Gaidon

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments in simulated physical domains demonstrate that DARD enables reliable reward comparisons without policy optimization and is significantly more predictive than baseline methods of downstream policy performance when dealing with learned reward functions.
Researcher Affiliation	Collaboration	Blake Wulfe, Logan Ellis, Jean Mercat, Rowan Mc Allister, Adrien Gaidon Toyota Research Institute (TRI) {first.last}@tri.global Ashwin Balakrishna University of California, Berkeley ashwin balakrishna@berkely.edu
Pseudocode	No	The paper does not contain any pseudocode or algorithm blocks.
Open Source Code	Yes	We have attempted to make the experimental results easily reproducible by providing the following: ﬁrst, source code necessary to reproduce results is available in the supplementary material.
Open Datasets	Yes	We evaluate DARD in two environments. We ﬁrst study the Bouncing Balls environment, which requires an agent to navigate from a random initial position to a random goal position while avoiding a set of balls randomly bouncing around the scene. The second environment is adapted from the Reacher environment from Open AI Gym (Brockman et al., 2016). [...] We collect sets of datasets for use in reward learning and evaluation. Each set contains three subdatasets for (i) reward model training, (ii) reward model validation, and (iii) reward evaluation. We collect these dataset sets using two policies. [...] Table 3: The sizes of the datasets used for reward learning and evaluation for each environment.
Dataset Splits	Yes	Table 3: The sizes of the datasets used for reward learning and evaluation for each environment. [...] Bouncing Balls 2,000,000 500,000 500,000 Reacher 4,000,000 1,000,000 1,000,000
Hardware Specification	No	For the Bouncing Balls environment, we approximate the transition function with a constant velocity model for all agents, implemented in Pytorch and executed on the GPU (Paszke et al., 2019). For the Reacher environment, we use the ground truth simulator, which we execute in parallel across 8 cores.
Software Dependencies	No	We used the RLlib (Liang et al., 2018) implementation of Proximal Policy Optimization (PPO) for learning control policies (Schulman et al., 2017). [...] For the Bouncing Balls environment, we approximate the transition function with a constant velocity model for all agents, implemented in Pytorch and executed on the GPU (Paszke et al., 2019). [...] Optimizer Adam (Kingma & Ba (2014)).
Experiment Setup	Yes	Table 2: PPO parameters. Table 4: Parameters for learning transition models. Table 5: Parameters for reward learning algorithms. These tables provide specific values for parameters such as Discount γ, GAE λ, # Timesteps Per Rollout, Train Epochs Per Rollout, # Minibatches Per Epoch, Entropy Coefﬁcient, PPO Clip Range, Learning Rate, Hidden Layer Sizes, # Workers, # Environments Per Worker, Total Timesteps, Batch Size, Activation Function, Normalization Layer, Gradient Clip, Optimizer, Reward Regularization, Trajectory Length, and # Random Pairs OOD.