reproducibilityindex.ai

The Effects of Reward Misspecification: Mapping and Mitigating Misaligned Models

Authors: Alexander Pan, Kush Bhatia, Jacob Steinhardt

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To understand how reward hacking arises, we construct four RL environments with misspecified rewards. We investigate reward hacking as a function of agent capabilities: model capacity, action space resolution, observation space noise, and training time.
Researcher Affiliation	Academia	Alexander Pan Caltech Kush Bhatia UC Berkeley Jacob Steinhardt UC Berkeley
Pseudocode	No	The paper does not contain any sections or figures explicitly labeled as 'Pseudocode' or 'Algorithm'.
Open Source Code	No	We provide several baseline anomaly detectors for this task and release our data at https://github.com/aypan17/ reward-misspecification. The statement specifically mentions "data" and not "code for the methodology".
Open Datasets	Yes	Atari Riverraid. The Atari Riverraid environment is run on Open AI Gym (Brockman et al., 2016). We use the Flow traffic simulator, implemented by Wu et al. (2021) and Vinitsky et al. (2018), which extends the popular SUMO traffic simulator (Lopez et al., 2018). The COVID environment, developed by Kompella et al. (2020)... The glucose environment, implemented in Fox et al. (2020)...
Dataset Splits	No	The paper mentions training and testing but does not provide specific details on training/validation/test dataset splits, such as percentages or sample counts.
Hardware Specification	No	The paper does not specify any particular hardware components such as GPU models, CPU types, or cloud instance specifications used for running experiments.
Software Dependencies	No	The paper mentions software like PPO, SAC, torchbeast (Py Torch implementation of IMPALA), Flow, and SUMO, but it does not provide specific version numbers for any of these software dependencies, nor for PyTorch itself.
Experiment Setup	No	The paper states, 'When available, we adopt the hyperparameters (except the learning rate and network size) given by the original codebase,' but it does not explicitly provide the specific hyperparameter values or detailed training configurations (e.g., learning rates, batch sizes, number of epochs) within the text.