reproducibilityindex.ai

Contrastive Preference Learning: Learning from Human Feedback without Reinforcement Learning

Authors: Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum, W. Bradley Knox, Dorsa Sadigh

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we address the following questions about CPL: First, is CPL effective at fine-tuning policies from regret-based preferences? Second, does CPL scale to high-dimensional control problems and larger networks? Third, what ingredients of CPL are important for attaining high performance? Additional experiments with human data and details are included in the appendix.
Researcher Affiliation	Academia	Joey Hejna Stanford University jhejna@cs.stanford.edu Rafael Rafailov Stanford University rafailov@cs.stanford.edu Harshit Sikchi UT Austin hsikchi@utexas.edu Chelsea Finn Stanford University Scott Niekum UMass Amherst W. Bradley Knox UT Austin Dorsa Sadigh Stanford University
Pseudocode	No	The paper describes the algorithm using mathematical formulations and descriptive text but does not include structured pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is released at https://github.com/jhejna/cpl
Open Datasets	Yes	We use six tasks from the Meta World robotics benchmark (Yu et al., 2020). ... We perform additional experiments with real-human preferences. Specifically, we adopt the benchmarks from Kim et al. (2023) which use either 100 (expert) or 500 (replay) real human preferences on datasets from the D4RL benchmark (Fu et al., 2020).
Dataset Splits	No	The paper describes the number of training steps and evaluation procedures (e.g., 'Every 5000 steps...we run 25 evaluation episodes'), but does not explicitly detail a separate validation split or how data is partitioned into distinct training, validation, and test sets. It discusses 'rollout data' and 'synthetic preference datasets' without specifying their internal splitting for different phases of model development.
Hardware Specification	Yes	Table 2: Computational efficiency of each method when learning from pixels for 200k training steps on a single Titan RTX GPU.
Software Dependencies	No	The paper mentions using Python and various frameworks or architectures (e.g., 'Dr Qv2'), but it does not specify exact version numbers for any software dependencies like programming languages, libraries, or specific solvers.
Experiment Setup	Yes	Table 5: Common Meta World Hyper-parameters. Table 6: Hyper-parameters for CPL and variants. Table 7: Hyperparameters for P-IQL and SFT for Meta World. Table 8: PPO Hyperparameters.