Causal Confusion and Reward Misidentification in Preference-Based Reward Learning
Authors: Jeremy Tien, Jerry Zhi-Yang He, Zackory Erickson, Anca Dragan, Daniel S. Brown
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In particular, we perform a series of sensitivity and ablation analyses on several benchmark domains where rewards learned from preferences achieve minimal test error but fail to generalize to outof-distribution states resulting in poor policy performance when optimized. |
| Researcher Affiliation | Academia | Jeremy Tien University of California, Berkeley jtien@berkeley.edu Jerry Zhi-Yang He University of California, Berkeley Zackory Erickson Carnegie Mellon University Anca D. Dragan University of California, Berkeley Daniel S. Brown University of Utah |
| Pseudocode | Yes | Algorithm 1 Preference-Based Reward Learning |
| Open Source Code | Yes | To facilitate reproducibility and encourage future research on causal reward learning, we open-source our code and training datasets: https://sites.google.com/view/ causal-reward-confusion. |
| Open Datasets | No | The paper mentions generating synthetic preference data and using standard environments like Reacher, Feeding, and Itch Scratching, but does not provide specific links or formal citations with author names and years for publicly available versions of these environments or the generated datasets themselves in a way that provides concrete access for reproducibility outside of their own provided code link which contains both. |
| Dataset Splits | Yes | We use held-out sets of trajectories for validation and testing. |
| Hardware Specification | No | No specific hardware (e.g., GPU model, CPU type) used for running experiments is mentioned. |
| Software Dependencies | No | In practice, we use the Adam optimizer in PyTorch to learn the reward function, rθ, and then use PPO Schulman et al. (2017) or SAC Haarnoja et al. (2018) for policy optimization given rθ. No specific version numbers are provided for PyTorch, PPO, or SAC. |
| Experiment Setup | Yes | Hyperparameters learning rate and weight decay are tuned coarsely using the MEDIUM dataset size due to runtime limits and cost of computation. The tuned hyperparameters (best performance on a held-out validation set) for each environment are as follows: Reacher: weightdecay=0.0001, lr=0.01, Feeding: weightdecay=0.00001, lr=0.001, Itch Scratching: weightdecay=0.001, lr=0.001. The hyperparameters for the PPO and SAC agents are as follows (if not specified, then hyperparameters are set to RLLib s default values): training batch size = 19200 number of SGD iterations = 50 SGD minibatch size = 128 lambda = 0.95 dimensions of fcnet hidden layers = [100, 100] learning starts = 1000 Q model fcnet hiddens = [100, 100] policy model fcnet hiddens = [100, 100] train batch size = 4096 actor learning rate = 3e-3 critic learning rate = 3e-3 entropy learning rate = 3e-3 |