Causal Confusion and Reward Misidentification in Preference-Based Reward Learning

Authors: Jeremy Tien, Jerry Zhi-Yang He, Zackory Erickson, Anca Dragan, Daniel S. Brown

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In particular, we perform a series of sensitivity and ablation analyses on several benchmark domains where rewards learned from preferences achieve minimal test error but fail to generalize to outof-distribution states resulting in poor policy performance when optimized.
Researcher Affiliation Academia Jeremy Tien University of California, Berkeley jtien@berkeley.edu Jerry Zhi-Yang He University of California, Berkeley Zackory Erickson Carnegie Mellon University Anca D. Dragan University of California, Berkeley Daniel S. Brown University of Utah
Pseudocode Yes Algorithm 1 Preference-Based Reward Learning
Open Source Code Yes To facilitate reproducibility and encourage future research on causal reward learning, we open-source our code and training datasets: https://sites.google.com/view/ causal-reward-confusion.
Open Datasets No The paper mentions generating synthetic preference data and using standard environments like Reacher, Feeding, and Itch Scratching, but does not provide specific links or formal citations with author names and years for publicly available versions of these environments or the generated datasets themselves in a way that provides concrete access for reproducibility outside of their own provided code link which contains both.
Dataset Splits Yes We use held-out sets of trajectories for validation and testing.
Hardware Specification No No specific hardware (e.g., GPU model, CPU type) used for running experiments is mentioned.
Software Dependencies No In practice, we use the Adam optimizer in PyTorch to learn the reward function, rθ, and then use PPO Schulman et al. (2017) or SAC Haarnoja et al. (2018) for policy optimization given rθ. No specific version numbers are provided for PyTorch, PPO, or SAC.
Experiment Setup Yes Hyperparameters learning rate and weight decay are tuned coarsely using the MEDIUM dataset size due to runtime limits and cost of computation. The tuned hyperparameters (best performance on a held-out validation set) for each environment are as follows: Reacher: weightdecay=0.0001, lr=0.01, Feeding: weightdecay=0.00001, lr=0.001, Itch Scratching: weightdecay=0.001, lr=0.001. The hyperparameters for the PPO and SAC agents are as follows (if not specified, then hyperparameters are set to RLLib s default values): training batch size = 19200 number of SGD iterations = 50 SGD minibatch size = 128 lambda = 0.95 dimensions of fcnet hidden layers = [100, 100] learning starts = 1000 Q model fcnet hiddens = [100, 100] policy model fcnet hiddens = [100, 100] train batch size = 4096 actor learning rate = 3e-3 critic learning rate = 3e-3 entropy learning rate = 3e-3