Goodhart's Law in Reinforcement Learning
Authors: Jacek Karwowski, Oliver Hayman, Xingjian Bai, Klaus Kiendlhofer, Charlie Griffin, Joar Max Viktor Skalse
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | First, we propose a way to quantify the magnitude of this effect and show empirically that optimising an imperfect proxy reward often leads to the behaviour predicted by Goodhart s law for a wide range of environments and reward functions. Finally, we evaluate our early stopping method experimentally. |
| Researcher Affiliation | Academia | Jacek Karwowski1 Oliver Hayman1 Xingjian Bai1 Klaus Kiendlhofer2 Charlie Griffin1 Joar Skalse1,3 1 University of Oxford 2 Independent 3 Future of Humanity Institute |
| Pseudocode | Yes | Algorithm 1 Iterative improvement algorithm |
| Open Source Code | No | The paper does not provide a direct statement about open-sourcing their code or a link to a repository for the described methodology. |
| Open Datasets | No | The paper describes custom-generated environments and reward sampling schemes (e.g., 'Random MDP is an environment in which... the transition matrix τ is sampled uniformly', 'Gridworld', 'Cliff'), but it does not provide concrete access information (links, DOIs, formal citations) to a publicly available or open dataset used for training. The datasets are generated for the experiments rather than being pre-existing. |
| Dataset Splits | No | The paper details the generation of various environments and the parameters varied for experiments (e.g., 'vary all hyperparameters of MDPs in a grid search manner', 'temporal discount factor γ P t0.5, 0.7, 0.9, 0.99u'), but it does not specify explicit train/validation/test dataset splits with percentages or sample counts. The experiments involve simulating policies in generated environments rather than using a fixed dataset partitioned for training and validation. |
| Hardware Specification | Yes | Overall, the process took about 100 hours of a c5a.16xlarge instance with 64 cores and 128 GB RAM, as well as about 100 hours of t2.2xlarge instance with 8 cores and 32 GB RAM. |
| Software Dependencies | No | The paper mentions that the 'optimisation algorithm is Value Iteration' and 'Maximal Causal Entropy (MCE)' and 'Boltzmann Rationality (BR)' are used, but it does not specify any software libraries, frameworks, or their version numbers (e.g., Python 3.x, PyTorch x.x, TensorFlow x.x). |
| Experiment Setup | Yes | Specifically, we sample: Gridworld for grid lengths n P t2, 3, . . . , 14u... For each of those, we also vary temporal discount factor γ P t0.5, 0.7, 0.9, 0.99u, sparsity factor σ P t0.1, 0.3, 0.5, 0.7, 0.9u, optimisation pressure λ logpxq for 7 values of x evenly spaced on r0.01, 0.75s and 20 values evenly spaced on r0.8, 0.99s. Each run consists of 10 proxy rewards; we use threshold θ 0.001 for value iteration. |