Contrastive Preference Learning: Learning from Human Feedback without Reinforcement Learning

Authors: Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum, W. Bradley Knox, Dorsa Sadigh

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we address the following questions about CPL: First, is CPL effective at fine-tuning policies from regret-based preferences? Second, does CPL scale to high-dimensional control problems and larger networks? Third, what ingredients of CPL are important for attaining high performance? Additional experiments with human data and details are included in the appendix.
Researcher Affiliation Academia Joey Hejna Stanford University jhejna@cs.stanford.edu Rafael Rafailov Stanford University rafailov@cs.stanford.edu Harshit Sikchi UT Austin hsikchi@utexas.edu Chelsea Finn Stanford University Scott Niekum UMass Amherst W. Bradley Knox UT Austin Dorsa Sadigh Stanford University
Pseudocode No The paper describes the algorithm using mathematical formulations and descriptive text but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes Our code is released at https://github.com/jhejna/cpl
Open Datasets Yes We use six tasks from the Meta World robotics benchmark (Yu et al., 2020). ... We perform additional experiments with real-human preferences. Specifically, we adopt the benchmarks from Kim et al. (2023) which use either 100 (expert) or 500 (replay) real human preferences on datasets from the D4RL benchmark (Fu et al., 2020).
Dataset Splits No The paper describes the number of training steps and evaluation procedures (e.g., 'Every 5000 steps...we run 25 evaluation episodes'), but does not explicitly detail a separate validation split or how data is partitioned into distinct training, validation, and test sets. It discusses 'rollout data' and 'synthetic preference datasets' without specifying their internal splitting for different phases of model development.
Hardware Specification Yes Table 2: Computational efficiency of each method when learning from pixels for 200k training steps on a single Titan RTX GPU.
Software Dependencies No The paper mentions using Python and various frameworks or architectures (e.g., 'Dr Qv2'), but it does not specify exact version numbers for any software dependencies like programming languages, libraries, or specific solvers.
Experiment Setup Yes Table 5: Common Meta World Hyper-parameters. Table 6: Hyper-parameters for CPL and variants. Table 7: Hyperparameters for P-IQL and SFT for Meta World. Table 8: PPO Hyperparameters.