reproducibilityindex.ai

Preference Transformer: Modeling Human Preferences using Transformers for RL

Authors: Changyeon Kim, Jongjin Park, Jinwoo Shin, Honglak Lee, Pieter Abbeel, Kimin Lee

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate that Preference Transformer can solve a variety of control tasks using real human preferences, while prior approaches fail to work. We also show that Preference Transformer can induce a well-specified reward and attend to critical events in the trajectory by automatically capturing the temporal dependencies in human decision-making. 5 EXPERIMENTS
Researcher Affiliation	Collaboration	Changyeon Kim1 Jongjin Park1 Jinwoo Shin1 Honglak Lee2,3 Pieter Abbeel4 Kimin Lee5 1KAIST 2University of Michigan 3LG AI Research 4UC Berkeley 5Google Research
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available on the project website: https://sites.google.com/view/preference-transformer.
Open Datasets	Yes	Similar to Shin & Brown (2021), we evaluate Preference Transformer (PT) on several complex control tasks in the offline setting using D4RL (Fu et al., 2020) benchmarks and Robomimic (Mandlekar et al., 2021) benchmarks. We will also publicly release the collected offline dataset with real human preferences for benchmarks.
Dataset Splits	No	The paper refers to using offline datasets and collecting human preferences, but it does not specify explicit training, validation, and test dataset splits with percentages or sample counts.
Hardware Specification	Yes	For training and evaluating our model, we use a single NVIDIA Ge Force RTX 2080 Ti GPU and 8 CPU cores (Intel Xeon CPU E5-2630 v4 @ 2.20GHz).
Software Dependencies	No	Our model is implemented based on a publicly available re-implementation of GPT in JAX (Frostig et al., 2018)8.
Experiment Setup	Yes	We use the hyperparameters in Table 2 for all experiments. Table 2 lists: Number of layers 1, Number of attention heads 4, Embedding dimension 256 (Casual transformer, Preference attention layer), Batch size 256, Dropout rate (embedding, attention, residual connection) 0.1, Learning rate 0.0001, Optimizer Adam W (Loshchilov & Hutter, 2019), Optimizer momentum β1 = 0.9, β2 = 0.99, Weight decay 0.0001, Warmup steps 500, Total gradient steps 10K.