The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

Authors: Alexander Pan, Kush Bhatia, Jacob Steinhardt

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To understand how reward hacking arises, we construct four RL environments with misspecified rewards. We investigate reward hacking as a function of agent capabilities: model capacity, action space resolution, observation space noise, and training time.
Researcher Affiliation Academia Alexander Pan Caltech Kush Bhatia UC Berkeley Jacob Steinhardt UC Berkeley
Pseudocode No The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code No We provide several baseline anomaly detectors for this task and release our data at https://github.com/aypan17/ reward-misspecification. The statement specifically mentions "data" and not "code for the methodology".
Open Datasets Yes Atari Riverraid. The Atari Riverraid environment is run on Open AI Gym (Brockman et al., 2016). We use the Flow traffic simulator, implemented by Wu et al. (2021) and Vinitsky et al. (2018), which extends the popular SUMO traffic simulator (Lopez et al., 2018). The COVID environment, developed by Kompella et al. (2020)... The glucose environment, implemented in Fox et al. (2020)...
Dataset Splits No The paper mentions training and testing but does not provide specific details on training/validation/test dataset splits, such as percentages or sample counts.
Hardware Specification No The paper does not specify any particular hardware components such as GPU models, CPU types, or cloud instance specifications used for running experiments.
Software Dependencies No The paper mentions software like PPO, SAC, torchbeast (Py Torch implementation of IMPALA), Flow, and SUMO, but it does not provide specific version numbers for any of these software dependencies, nor for PyTorch itself.
Experiment Setup No The paper states, 'When available, we adopt the hyperparameters (except the learning rate and network size) given by the original codebase,' but it does not explicitly provide the specific hyperparameter values or detailed training configurations (e.g., learning rates, batch sizes, number of epochs) within the text.