Inverse Preference Learning: Preference-based RL without a Reward Function

Authors: Joey Hejna, Dorsa Sadigh

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Across a suite of continuous control and robotics benchmarks, IPL attains competitive performance compared to more complex approaches that leverage transformer-based and non-Markovian reward functions while having fewer algorithmic hyperparameters and learned network parameters. Our code is publicly released1. Experimentally, we find that even though IPL does not explicitly learn a reward function, it achieves competitive performance with complicated Transformer-based reward learning techniques on offline Preference-based RL benchmarks with real-human feedback.
Researcher Affiliation Academia Joey Hejna Stanford University jhejna@cs.stanford.edu Dorsa Sadigh Stanford University dorsa@cs.stanford.edu
Pseudocode Yes Algorithm 1: IPL Algorithm (XQL Variant) Algorithm 2: IPL Algorithm (IQL Variant) Algorithm 3: IPL Algorithm (AWAC Variant)
Open Source Code Yes Our code is publicly released1. 1https://github.com/jhejna/inverse-preference-learning
Open Datasets Yes We compare IPL to other offline preference-based RL approaches on D4RL Gym Locomotion [17] and Robosuite robotics [36] datasets with real-human preference data from Kim et al. [28].
Dataset Splits No In offline preference-based RL, we assume access to a fixed offline dataset D𝑜= {(𝑠, 𝑎, 𝑠 )} of interactions without reward labels generated by a reference policy 𝜇(𝑎|𝑠) in addition to the preference dataset D𝑝. (Explanation: The paper describes the types of datasets used, D_p and D_o, but does not provide specific train/validation/test splits (e.g., percentages, sample counts, or explicit instructions for partitioning the data) for reproducibility.)
Hardware Specification No Given that we run experiments using MLPs, all of our experiments were run on CPU compute resources. Each seed for each method requires one CPU core and 8 Gb of memory. (Explanation: The paper mentions "CPU compute resources" and memory, but does not provide specific details such as the CPU model, number of cores, or clock speed for the hardware used.)
Software Dependencies No The paper mentions "Optimizer Adam" in its hyperparameter tables but does not provide specific version numbers for software dependencies or libraries.
Experiment Setup Yes Common Hyperparameters: Learning Rate 0.0003, Optimizer Adam, Beta 3.0, Tau 0.7, DoBatch Size 256, DpBatch Size 8, Training Steps 1 Mil. (Table 5) and Common Hyperparameters: Learning Rate 0.0003, Optimizer Adam, Beta 4.0, Tau 0.7, DpBatch Size 16, Training Steps 200k. (Table 6)