Preference Transformer: Modeling Human Preferences using Transformers for RL
Authors: Changyeon Kim, Jongjin Park, Jinwoo Shin, Honglak Lee, Pieter Abbeel, Kimin Lee
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate that Preference Transformer can solve a variety of control tasks using real human preferences, while prior approaches fail to work. We also show that Preference Transformer can induce a well-specified reward and attend to critical events in the trajectory by automatically capturing the temporal dependencies in human decision-making. 5 EXPERIMENTS |
| Researcher Affiliation | Collaboration | Changyeon Kim1 Jongjin Park1 Jinwoo Shin1 Honglak Lee2,3 Pieter Abbeel4 Kimin Lee5 1KAIST 2University of Michigan 3LG AI Research 4UC Berkeley 5Google Research |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available on the project website: https://sites.google.com/view/preference-transformer. |
| Open Datasets | Yes | Similar to Shin & Brown (2021), we evaluate Preference Transformer (PT) on several complex control tasks in the offline setting using D4RL (Fu et al., 2020) benchmarks and Robomimic (Mandlekar et al., 2021) benchmarks. We will also publicly release the collected offline dataset with real human preferences for benchmarks. |
| Dataset Splits | No | The paper refers to using offline datasets and collecting human preferences, but it does not specify explicit training, validation, and test dataset splits with percentages or sample counts. |
| Hardware Specification | Yes | For training and evaluating our model, we use a single NVIDIA Ge Force RTX 2080 Ti GPU and 8 CPU cores (Intel Xeon CPU E5-2630 v4 @ 2.20GHz). |
| Software Dependencies | No | Our model is implemented based on a publicly available re-implementation of GPT in JAX (Frostig et al., 2018)8. |
| Experiment Setup | Yes | We use the hyperparameters in Table 2 for all experiments. Table 2 lists: Number of layers 1, Number of attention heads 4, Embedding dimension 256 (Casual transformer, Preference attention layer), Batch size 256, Dropout rate (embedding, attention, residual connection) 0.1, Learning rate 0.0001, Optimizer Adam W (Loshchilov & Hutter, 2019), Optimizer momentum β1 = 0.9, β2 = 0.99, Weight decay 0.0001, Warmup steps 500, Total gradient steps 10K. |