reproducibilityindex.ai

Optimal Policies Tend To Seek Power

Authors: Alex Turner, Logan Smith, Rohin Shah, Andrew Critch, Prasad Tadepalli

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our contributions are threefold. First, we develop a formal theory of power-seeking... Second, we provide empirical evidence of power-seeking policies in tabular and deep reinforcement learning (RL) agents via a suite of Gridworld experiments... Our theoretical results explain why optimal policies tend to seek power, and our empirical demonstrations indicate that this phenomenon is already present in simple environments with current RL methods.
Researcher Affiliation	Collaboration	Alex Turner1 , Zachary Kent1 , Andrew Critch2 , Richard Ngo3 , David Lindner1 , Lawrence Chan1 , David Krueger4 , Jan Leike3 1 DeepMind, 2 UC Berkeley, 3 OpenAI, 4 University of Cambridge, Vector Institute, CIFAR
Pseudocode	No	No structured pseudocode or algorithm blocks are present in the paper.
Open Source Code	No	Code for the experiments and plots is available upon request.
Open Datasets	Yes	We consider a simple Gridworld environment... For our MiniGrid experiments, we use the MiniGrid library...
Dataset Splits	No	No specific dataset split information (percentages, sample counts, or detailed splitting methodology) is provided for train, validation, or test sets.
Hardware Specification	No	No specific hardware details (e.g., GPU/CPU models, processor types, memory amounts) used for running experiments are provided.
Software Dependencies	No	The paper mentions 'MiniGrid library' and 'JAX' but does not provide specific version numbers for these or other software dependencies.
Experiment Setup	Yes	Appendix B: Experimental Details... Specifically, we use an Adam optimizer with a learning rate of 10−4 and a batch size of 32. The discount factor γ is 0.99. For the MiniGrid experiments, we train the agent for 200 million environment steps. For the tabular Gridworld experiments, we use a learning rate of 0.1 for Q-learning.