Goodhart's Law in Reinforcement Learning

Authors: Jacek Karwowski, Oliver Hayman, Xingjian Bai, Klaus Kiendlhofer, Charlie Griffin, Joar Max Viktor Skalse

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental First, we propose a way to quantify the magnitude of this effect and show empirically that optimising an imperfect proxy reward often leads to the behaviour predicted by Goodhart s law for a wide range of environments and reward functions. Finally, we evaluate our early stopping method experimentally.
Researcher Affiliation Academia Jacek Karwowski1 Oliver Hayman1 Xingjian Bai1 Klaus Kiendlhofer2 Charlie Griffin1 Joar Skalse1,3 1 University of Oxford 2 Independent 3 Future of Humanity Institute
Pseudocode Yes Algorithm 1 Iterative improvement algorithm
Open Source Code No The paper does not provide a direct statement about open-sourcing their code or a link to a repository for the described methodology.
Open Datasets No The paper describes custom-generated environments and reward sampling schemes (e.g., 'Random MDP is an environment in which... the transition matrix τ is sampled uniformly', 'Gridworld', 'Cliff'), but it does not provide concrete access information (links, DOIs, formal citations) to a publicly available or open dataset used for training. The datasets are generated for the experiments rather than being pre-existing.
Dataset Splits No The paper details the generation of various environments and the parameters varied for experiments (e.g., 'vary all hyperparameters of MDPs in a grid search manner', 'temporal discount factor γ P t0.5, 0.7, 0.9, 0.99u'), but it does not specify explicit train/validation/test dataset splits with percentages or sample counts. The experiments involve simulating policies in generated environments rather than using a fixed dataset partitioned for training and validation.
Hardware Specification Yes Overall, the process took about 100 hours of a c5a.16xlarge instance with 64 cores and 128 GB RAM, as well as about 100 hours of t2.2xlarge instance with 8 cores and 32 GB RAM.
Software Dependencies No The paper mentions that the 'optimisation algorithm is Value Iteration' and 'Maximal Causal Entropy (MCE)' and 'Boltzmann Rationality (BR)' are used, but it does not specify any software libraries, frameworks, or their version numbers (e.g., Python 3.x, PyTorch x.x, TensorFlow x.x).
Experiment Setup Yes Specifically, we sample: Gridworld for grid lengths n P t2, 3, . . . , 14u... For each of those, we also vary temporal discount factor γ P t0.5, 0.7, 0.9, 0.99u, sparsity factor σ P t0.1, 0.3, 0.5, 0.7, 0.9u, optimisation pressure λ logpxq for 7 values of x evenly spaced on r0.01, 0.75s and 20 values evenly spaced on r0.8, 0.99s. Each run consists of 10 proxy rewards; we use threshold θ 0.001 for value iteration.