AI Alignment with Changing and Influenceable Reward Functions
Authors: Micah Carroll, Davis Foote, Anand Siththaranjan, Stuart Russell, Anca Dragan
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Theoretical | Our main contributions can be summarized as follows: 1. We provide the formal language of Dynamic Reward MDPs (DR-MDPs) for analyzing AI decisions and influence in settings with changing reward functions. 2. We show how existing AI alignment techniques may systematically incentivize questionable influence when used in dynamic-reward settings. 3. By comparing 8 natural notions of alignment, and showing that they all may either fail to avoid undesirable influence or are impractically risk-averse, we elucidate trade-offs that seem inherent to choosing any objective. |
| Researcher Affiliation | Academia | 1UC Berkeley. Correspondence to: mdc@berkeley.edu. |
| Pseudocode | Yes | Algorithm 1 Learning reward functions and their dynamics |
| Open Source Code | No | The paper does not provide any links to open-source code or explicitly state that code for the methodology is being released. |
| Open Datasets | No | The paper uses illustrative toy examples (e.g., Conspiracy Influence DR-MDP, Writer’s curse, Clickbait DR-MDP) for theoretical analysis, not publicly available datasets for empirical training or evaluation. |
| Dataset Splits | No | The paper is theoretical and does not involve empirical validation on datasets, thus no dataset splits for training, validation, or testing are provided. |
| Hardware Specification | No | The paper is theoretical and does not describe any specific hardware used for experiments. |
| Software Dependencies | No | The paper is theoretical and does not mention any specific software dependencies with version numbers needed to replicate its work. |
| Experiment Setup | No | The paper is theoretical and does not describe an experimental setup with specific hyperparameters or training configurations. |