reproducibilityindex.ai

Quantifying Differences in Reward Functions

Authors: Adam Gleave, Michael D Dennis, Shane Legg, Stuart Russell, Jan Leike

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate EPIC and the baselines ERC and NPEC in a variety of continuous control tasks. In section 6.1, we compute the distance between hand-designed reward functions, ﬁnding EPIC to be the most reliable. NPEC has substantial approximation error, and ERC sometimes erroneously assigns high distance to equivalent rewards. Next, in section 6.2 we show EPIC is robust to the exact choice of coverage distribution D, whereas ERC and especially NPEC are highly sensitive to the choice of D. Finally, in section 6.3 we ﬁnd that the distance of learned reward functions to a ground-truth reward predicts the return obtained by policy training, even in an unseen test environment.
Researcher Affiliation	Collaboration	Adam Gleave1,2 Michael Dennis1 Shane Legg2 Stuart Russell1 Jan Leike3 1UC Berkeley 2Deep Mind 3Open AI
Pseudocode	No	The paper includes mathematical formulations and definitions but does not contain a distinct pseudocode or algorithm block.
Open Source Code	Yes	Our source code is available at https: //github.com/Human Compatible AI/evaluating-rewards.
Open Datasets	Yes	We evaluate in the Point Maze Mu Jo Co task from Fu et al. [8], where a point mass agent must navigate around a wall to reach a goal. All algorithms are trained on synthetic data generated from the ground-truth reward function.
Dataset Splits	No	The paper does not provide specific details on training, validation, and test splits (e.g., percentages or sample counts) for its synthetic data or environments.
Hardware Specification	Yes	Experiments were conducted on a workstation (Intel i9-7920X CPU with 64 GB of RAM), and a small number of r5.24xlarge AWS VM instances, with 48 CPU cores on an Intel Skylake processor and 768 GB of RAM.
Software Dependencies	No	The paper mentions 'Stable Baselines [9]' and 'Tensor Flow' but does not specify their version numbers for reproducibility.
Experiment Setup	Yes	Table A.1 summarizes the hyperparameters and distributions used to compute the distances between reward functions. Table A.2: Hyperparameters for proximal policy optimisation (PPO) [19]... Table A.3: Hyperparameters for adversarial inverse reinforcement learning (AIRL)... Table A.4: Hyperparameters for preference comparison... Table A.5: Hyperparameters for regression...