Optimal Policies Tend To Seek Power
Authors: Alex Turner, Logan Smith, Rohin Shah, Andrew Critch, Prasad Tadepalli
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our contributions are threefold. First, we develop a formal theory of power-seeking... Second, we provide empirical evidence of power-seeking policies in tabular and deep reinforcement learning (RL) agents via a suite of Gridworld experiments... Our theoretical results explain why optimal policies tend to seek power, and our empirical demonstrations indicate that this phenomenon is already present in simple environments with current RL methods. |
| Researcher Affiliation | Collaboration | Alex Turner1 , Zachary Kent1 , Andrew Critch2 , Richard Ngo3 , David Lindner1 , Lawrence Chan1 , David Krueger4 , Jan Leike3 1 DeepMind, 2 UC Berkeley, 3 OpenAI, 4 University of Cambridge, Vector Institute, CIFAR |
| Pseudocode | No | No structured pseudocode or algorithm blocks are present in the paper. |
| Open Source Code | No | Code for the experiments and plots is available upon request. |
| Open Datasets | Yes | We consider a simple Gridworld environment... For our MiniGrid experiments, we use the MiniGrid library... |
| Dataset Splits | No | No specific dataset split information (percentages, sample counts, or detailed splitting methodology) is provided for train, validation, or test sets. |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, processor types, memory amounts) used for running experiments are provided. |
| Software Dependencies | No | The paper mentions 'MiniGrid library' and 'JAX' but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | Appendix B: Experimental Details... Specifically, we use an Adam optimizer with a learning rate of 10−4 and a batch size of 32. The discount factor γ is 0.99. For the MiniGrid experiments, we train the agent for 200 million environment steps. For the tabular Gridworld experiments, we use a learning rate of 0.1 for Q-learning. |