Inverse Preference Learning: Preference-based RL without a Reward Function
Authors: Joey Hejna, Dorsa Sadigh
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Across a suite of continuous control and robotics benchmarks, IPL attains competitive performance compared to more complex approaches that leverage transformer-based and non-Markovian reward functions while having fewer algorithmic hyperparameters and learned network parameters. Our code is publicly released1. Experimentally, we find that even though IPL does not explicitly learn a reward function, it achieves competitive performance with complicated Transformer-based reward learning techniques on offline Preference-based RL benchmarks with real-human feedback. |
| Researcher Affiliation | Academia | Joey Hejna Stanford University jhejna@cs.stanford.edu Dorsa Sadigh Stanford University dorsa@cs.stanford.edu |
| Pseudocode | Yes | Algorithm 1: IPL Algorithm (XQL Variant) Algorithm 2: IPL Algorithm (IQL Variant) Algorithm 3: IPL Algorithm (AWAC Variant) |
| Open Source Code | Yes | Our code is publicly released1. 1https://github.com/jhejna/inverse-preference-learning |
| Open Datasets | Yes | We compare IPL to other offline preference-based RL approaches on D4RL Gym Locomotion [17] and Robosuite robotics [36] datasets with real-human preference data from Kim et al. [28]. |
| Dataset Splits | No | In offline preference-based RL, we assume access to a fixed offline dataset D𝑜= {(𝑠, 𝑎, 𝑠 )} of interactions without reward labels generated by a reference policy 𝜇(𝑎|𝑠) in addition to the preference dataset D𝑝. (Explanation: The paper describes the types of datasets used, D_p and D_o, but does not provide specific train/validation/test splits (e.g., percentages, sample counts, or explicit instructions for partitioning the data) for reproducibility.) |
| Hardware Specification | No | Given that we run experiments using MLPs, all of our experiments were run on CPU compute resources. Each seed for each method requires one CPU core and 8 Gb of memory. (Explanation: The paper mentions "CPU compute resources" and memory, but does not provide specific details such as the CPU model, number of cores, or clock speed for the hardware used.) |
| Software Dependencies | No | The paper mentions "Optimizer Adam" in its hyperparameter tables but does not provide specific version numbers for software dependencies or libraries. |
| Experiment Setup | Yes | Common Hyperparameters: Learning Rate 0.0003, Optimizer Adam, Beta 3.0, Tau 0.7, DoBatch Size 256, DpBatch Size 8, Training Steps 1 Mil. (Table 5) and Common Hyperparameters: Learning Rate 0.0003, Optimizer Adam, Beta 4.0, Tau 0.7, DpBatch Size 16, Training Steps 200k. (Table 6) |