AI Alignment with Changing and Influenceable Reward Functions

Authors: Micah Carroll, Davis Foote, Anand Siththaranjan, Stuart Russell, Anca Dragan

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical Our main contributions can be summarized as follows: 1. We provide the formal language of Dynamic Reward MDPs (DR-MDPs) for analyzing AI decisions and influence in settings with changing reward functions. 2. We show how existing AI alignment techniques may systematically incentivize questionable influence when used in dynamic-reward settings. 3. By comparing 8 natural notions of alignment, and showing that they all may either fail to avoid undesirable influence or are impractically risk-averse, we elucidate trade-offs that seem inherent to choosing any objective.
Researcher Affiliation Academia 1UC Berkeley. Correspondence to: mdc@berkeley.edu.
Pseudocode Yes Algorithm 1 Learning reward functions and their dynamics
Open Source Code No The paper does not provide any links to open-source code or explicitly state that code for the methodology is being released.
Open Datasets No The paper uses illustrative toy examples (e.g., Conspiracy Influence DR-MDP, Writer’s curse, Clickbait DR-MDP) for theoretical analysis, not publicly available datasets for empirical training or evaluation.
Dataset Splits No The paper is theoretical and does not involve empirical validation on datasets, thus no dataset splits for training, validation, or testing are provided.
Hardware Specification No The paper is theoretical and does not describe any specific hardware used for experiments.
Software Dependencies No The paper is theoretical and does not mention any specific software dependencies with version numbers needed to replicate its work.
Experiment Setup No The paper is theoretical and does not describe an experimental setup with specific hyperparameters or training configurations.