Quantifying Differences in Reward Functions

Authors: Adam Gleave, Michael D Dennis, Shane Legg, Stuart Russell, Jan Leike

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate EPIC and the baselines ERC and NPEC in a variety of continuous control tasks. In section 6.1, we compute the distance between hand-designed reward functions, finding EPIC to be the most reliable. NPEC has substantial approximation error, and ERC sometimes erroneously assigns high distance to equivalent rewards. Next, in section 6.2 we show EPIC is robust to the exact choice of coverage distribution D, whereas ERC and especially NPEC are highly sensitive to the choice of D. Finally, in section 6.3 we find that the distance of learned reward functions to a ground-truth reward predicts the return obtained by policy training, even in an unseen test environment.
Researcher Affiliation Collaboration Adam Gleave1,2 Michael Dennis1 Shane Legg2 Stuart Russell1 Jan Leike3 1UC Berkeley 2Deep Mind 3Open AI
Pseudocode No The paper includes mathematical formulations and definitions but does not contain a distinct pseudocode or algorithm block.
Open Source Code Yes Our source code is available at https: //github.com/Human Compatible AI/evaluating-rewards.
Open Datasets Yes We evaluate in the Point Maze Mu Jo Co task from Fu et al. [8], where a point mass agent must navigate around a wall to reach a goal. All algorithms are trained on synthetic data generated from the ground-truth reward function.
Dataset Splits No The paper does not provide specific details on training, validation, and test splits (e.g., percentages or sample counts) for its synthetic data or environments.
Hardware Specification Yes Experiments were conducted on a workstation (Intel i9-7920X CPU with 64 GB of RAM), and a small number of r5.24xlarge AWS VM instances, with 48 CPU cores on an Intel Skylake processor and 768 GB of RAM.
Software Dependencies No The paper mentions 'Stable Baselines [9]' and 'Tensor Flow' but does not specify their version numbers for reproducibility.
Experiment Setup Yes Table A.1 summarizes the hyperparameters and distributions used to compute the distances between reward functions. Table A.2: Hyperparameters for proximal policy optimisation (PPO) [19]... Table A.3: Hyperparameters for adversarial inverse reinforcement learning (AIRL)... Table A.4: Hyperparameters for preference comparison... Table A.5: Hyperparameters for regression...