Dynamics-Aware Comparison of Learned Reward Functions

Authors: Blake Wulfe, Logan Michael Ellis, Jean Mercat, Rowan Thomas McAllister, Adrien Gaidon

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments in simulated physical domains demonstrate that DARD enables reliable reward comparisons without policy optimization and is significantly more predictive than baseline methods of downstream policy performance when dealing with learned reward functions.
Researcher Affiliation Collaboration Blake Wulfe, Logan Ellis, Jean Mercat, Rowan Mc Allister, Adrien Gaidon Toyota Research Institute (TRI) {first.last}@tri.global Ashwin Balakrishna University of California, Berkeley ashwin balakrishna@berkely.edu
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code Yes We have attempted to make the experimental results easily reproducible by providing the following: first, source code necessary to reproduce results is available in the supplementary material.
Open Datasets Yes We evaluate DARD in two environments. We first study the Bouncing Balls environment, which requires an agent to navigate from a random initial position to a random goal position while avoiding a set of balls randomly bouncing around the scene. The second environment is adapted from the Reacher environment from Open AI Gym (Brockman et al., 2016). [...] We collect sets of datasets for use in reward learning and evaluation. Each set contains three subdatasets for (i) reward model training, (ii) reward model validation, and (iii) reward evaluation. We collect these dataset sets using two policies. [...] Table 3: The sizes of the datasets used for reward learning and evaluation for each environment.
Dataset Splits Yes Table 3: The sizes of the datasets used for reward learning and evaluation for each environment. [...] Bouncing Balls 2,000,000 500,000 500,000 Reacher 4,000,000 1,000,000 1,000,000
Hardware Specification No For the Bouncing Balls environment, we approximate the transition function with a constant velocity model for all agents, implemented in Pytorch and executed on the GPU (Paszke et al., 2019). For the Reacher environment, we use the ground truth simulator, which we execute in parallel across 8 cores.
Software Dependencies No We used the RLlib (Liang et al., 2018) implementation of Proximal Policy Optimization (PPO) for learning control policies (Schulman et al., 2017). [...] For the Bouncing Balls environment, we approximate the transition function with a constant velocity model for all agents, implemented in Pytorch and executed on the GPU (Paszke et al., 2019). [...] Optimizer Adam (Kingma & Ba (2014)).
Experiment Setup Yes Table 2: PPO parameters. Table 4: Parameters for learning transition models. Table 5: Parameters for reward learning algorithms. These tables provide specific values for parameters such as Discount γ, GAE λ, # Timesteps Per Rollout, Train Epochs Per Rollout, # Minibatches Per Epoch, Entropy Coefficient, PPO Clip Range, Learning Rate, Hidden Layer Sizes, # Workers, # Environments Per Worker, Total Timesteps, Batch Size, Activation Function, Normalization Layer, Gradient Clip, Optimizer, Reward Regularization, Trajectory Length, and # Random Pairs OOD.