reproducibilityindex.ai

AI Alignment with Changing and Influenceable Reward Functions

Authors: Micah Carroll, Davis Foote, Anand Siththaranjan, Stuart Russell, Anca Dragan

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Theoretical	Our main contributions can be summarized as follows: 1. We provide the formal language of Dynamic Reward MDPs (DR-MDPs) for analyzing AI decisions and influence in settings with changing reward functions. 2. We show how existing AI alignment techniques may systematically incentivize questionable influence when used in dynamic-reward settings. 3. By comparing 8 natural notions of alignment, and showing that they all may either fail to avoid undesirable influence or are impractically risk-averse, we elucidate trade-offs that seem inherent to choosing any objective.
Researcher Affiliation	Academia	1UC Berkeley. Correspondence to: mdc@berkeley.edu.
Pseudocode	Yes	Algorithm 1 Learning reward functions and their dynamics
Open Source Code	No	The paper does not provide any links to open-source code or explicitly state that code for the methodology is being released.
Open Datasets	No	The paper uses illustrative toy examples (e.g., Conspiracy Influence DR-MDP, Writer’s curse, Clickbait DR-MDP) for theoretical analysis, not publicly available datasets for empirical training or evaluation.
Dataset Splits	No	The paper is theoretical and does not involve empirical validation on datasets, thus no dataset splits for training, validation, or testing are provided.
Hardware Specification	No	The paper is theoretical and does not describe any specific hardware used for experiments.
Software Dependencies	No	The paper is theoretical and does not mention any specific software dependencies with version numbers needed to replicate its work.
Experiment Setup	No	The paper is theoretical and does not describe an experimental setup with specific hyperparameters or training configurations.