Contrastive Preference Learning: Learning from Human Feedback without Reinforcement Learning
Authors: Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum, W. Bradley Knox, Dorsa Sadigh
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we address the following questions about CPL: First, is CPL effective at fine-tuning policies from regret-based preferences? Second, does CPL scale to high-dimensional control problems and larger networks? Third, what ingredients of CPL are important for attaining high performance? Additional experiments with human data and details are included in the appendix. |
| Researcher Affiliation | Academia | Joey Hejna Stanford University jhejna@cs.stanford.edu Rafael Rafailov Stanford University rafailov@cs.stanford.edu Harshit Sikchi UT Austin hsikchi@utexas.edu Chelsea Finn Stanford University Scott Niekum UMass Amherst W. Bradley Knox UT Austin Dorsa Sadigh Stanford University |
| Pseudocode | No | The paper describes the algorithm using mathematical formulations and descriptive text but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is released at https://github.com/jhejna/cpl |
| Open Datasets | Yes | We use six tasks from the Meta World robotics benchmark (Yu et al., 2020). ... We perform additional experiments with real-human preferences. Specifically, we adopt the benchmarks from Kim et al. (2023) which use either 100 (expert) or 500 (replay) real human preferences on datasets from the D4RL benchmark (Fu et al., 2020). |
| Dataset Splits | No | The paper describes the number of training steps and evaluation procedures (e.g., 'Every 5000 steps...we run 25 evaluation episodes'), but does not explicitly detail a separate validation split or how data is partitioned into distinct training, validation, and test sets. It discusses 'rollout data' and 'synthetic preference datasets' without specifying their internal splitting for different phases of model development. |
| Hardware Specification | Yes | Table 2: Computational efficiency of each method when learning from pixels for 200k training steps on a single Titan RTX GPU. |
| Software Dependencies | No | The paper mentions using Python and various frameworks or architectures (e.g., 'Dr Qv2'), but it does not specify exact version numbers for any software dependencies like programming languages, libraries, or specific solvers. |
| Experiment Setup | Yes | Table 5: Common Meta World Hyper-parameters. Table 6: Hyper-parameters for CPL and variants. Table 7: Hyperparameters for P-IQL and SFT for Meta World. Table 8: PPO Hyperparameters. |