Consequences of Misaligned AI

Authors: Simon Zhuang, Dylan Hadfield-Menell

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical Our model (Fig. 1, left), considers a resource-constrained world where the L attributes of the state correspond to different sources of utility for the (human) principal. We model incomplete specification by limiting the (artificial) agent’s reward function to have support on J < L attributes of the world. Our main result identifies conditions such that any misalignment is costly: starting from any initial state, optimizing any fixed incomplete proxy eventually leads the principal to be arbitrarily worse off. We show relaxing the assumptions of this theorem allows the principal to gain utility from the autonomous agent. Our results provide theoretical justification for impact avoidance (23) and interactive reward learning (19) as solutions to alignment problems.
Researcher Affiliation Academia Simon Zhuang Center for Human-Compatible AI University of California, Berkeley Berkeley, CA 94709 simonzhuang@berkeley.edu Dylan Hadfield-Menell Center for Human-Compatible AI University of California, Berkeley Berkeley, CA 94709 dhm@berkeley.edu
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide concrete access to source code for the methodology described.
Open Datasets No The paper presents a theoretical model and does not use or reference any publicly available dataset for training.
Dataset Splits No The paper is theoretical and does not involve dataset splits for validation.
Hardware Specification No The paper is theoretical and does not describe any specific hardware used for experiments.
Software Dependencies No The paper is theoretical and does not mention specific software dependencies with version numbers.
Experiment Setup No The paper is theoretical and does not detail any experimental setup or hyperparameters.