Consequences of Misaligned AI
Authors: Simon Zhuang, Dylan Hadfield-Menell
NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | Our model (Fig. 1, left), considers a resource-constrained world where the L attributes of the state correspond to different sources of utility for the (human) principal. We model incomplete specification by limiting the (artificial) agent’s reward function to have support on J < L attributes of the world. Our main result identifies conditions such that any misalignment is costly: starting from any initial state, optimizing any fixed incomplete proxy eventually leads the principal to be arbitrarily worse off. We show relaxing the assumptions of this theorem allows the principal to gain utility from the autonomous agent. Our results provide theoretical justification for impact avoidance (23) and interactive reward learning (19) as solutions to alignment problems. |
| Researcher Affiliation | Academia | Simon Zhuang Center for Human-Compatible AI University of California, Berkeley Berkeley, CA 94709 simonzhuang@berkeley.edu Dylan Hadfield-Menell Center for Human-Compatible AI University of California, Berkeley Berkeley, CA 94709 dhm@berkeley.edu |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. |
| Open Datasets | No | The paper presents a theoretical model and does not use or reference any publicly available dataset for training. |
| Dataset Splits | No | The paper is theoretical and does not involve dataset splits for validation. |
| Hardware Specification | No | The paper is theoretical and does not describe any specific hardware used for experiments. |
| Software Dependencies | No | The paper is theoretical and does not mention specific software dependencies with version numbers. |
| Experiment Setup | No | The paper is theoretical and does not detail any experimental setup or hyperparameters. |