The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models
Authors: Alexander Pan, Kush Bhatia, Jacob Steinhardt
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To understand how reward hacking arises, we construct four RL environments with misspecified rewards. We investigate reward hacking as a function of agent capabilities: model capacity, action space resolution, observation space noise, and training time. |
| Researcher Affiliation | Academia | Alexander Pan Caltech Kush Bhatia UC Berkeley Jacob Steinhardt UC Berkeley |
| Pseudocode | No | The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm'. |
| Open Source Code | No | We provide several baseline anomaly detectors for this task and release our data at https://github.com/aypan17/ reward-misspecification. The statement specifically mentions "data" and not "code for the methodology". |
| Open Datasets | Yes | Atari Riverraid. The Atari Riverraid environment is run on Open AI Gym (Brockman et al., 2016). We use the Flow traffic simulator, implemented by Wu et al. (2021) and Vinitsky et al. (2018), which extends the popular SUMO traffic simulator (Lopez et al., 2018). The COVID environment, developed by Kompella et al. (2020)... The glucose environment, implemented in Fox et al. (2020)... |
| Dataset Splits | No | The paper mentions training and testing but does not provide specific details on training/validation/test dataset splits, such as percentages or sample counts. |
| Hardware Specification | No | The paper does not specify any particular hardware components such as GPU models, CPU types, or cloud instance specifications used for running experiments. |
| Software Dependencies | No | The paper mentions software like PPO, SAC, torchbeast (Py Torch implementation of IMPALA), Flow, and SUMO, but it does not provide specific version numbers for any of these software dependencies, nor for PyTorch itself. |
| Experiment Setup | No | The paper states, 'When available, we adopt the hyperparameters (except the learning rate and network size) given by the original codebase,' but it does not explicitly provide the specific hyperparameter values or detailed training configurations (e.g., learning rates, batch sizes, number of epochs) within the text. |