On the Sensitivity of Reward Inference to Misspecified Human Models

Authors: Joey Hong, Kush Bhatia, Anca Dragan

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We study this question both theoretically and empirically. We do show that it is unfortunately possible to construct small adversarial biases in behavior that lead to arbitrarily large errors in the inferred reward. However, and arguably more importantly, we are also able to identify reasonable assumptions under which the reward inference error can be bounded linearly in the error in the human model. Finally, we verify our theoretical insights in discrete and continuous control tasks with simulated and human data.
Researcher Affiliation Academia Joey Hong UC Berkeley joey_hong@berkeley.edu Kush Bhatia Stanford Anca Dragan UC Berkeley
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code No The paper mentions using 'OpenAI Gym' but does not provide a link or statement about releasing its own source code for the methodology.
Open Datasets Yes We consider both tabular navigation tasks on gridworld (Fu et al., 2019b), as well as more challenging continuous control tasks on the Lunar Lander game (Brockman et al., 2016b).
Dataset Splits No The paper mentions sampling datasets of a certain size but does not specify explicit training, validation, or test splits, nor does it refer to predefined splits from cited works for its experimental data.
Hardware Specification No The paper does not specify any hardware details (e.g., GPU models, CPU types) used for running the experiments.
Software Dependencies No The paper mentions 'soft actor-critic (SAC) (Haarnoja et al., 2018)' and 'OpenAI Gym (Brockman et al., 2016a)' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes The policy is parameterized as a 3-layer fully-connected neural network with hidden dimension of 128, and outputs a squashed Gaussian distribution over actions. Because the state and action space are continuous, we use soft-actor-critic (SAC) (Haarnoja et al., 2018) with fixed entropy regularization α = 1. We train the policy for 600 episodes of length at most 1, 000, with a batch size of 264, until the policy was able to land on the landing pad with a high success rate.