Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Contrastive Preference Learning: Learning from Human Feedback without Reinforcement Learning
Authors: Joey Hejna, Rafael Rafailov, Harshit Sikchi, Chelsea Finn, Scott Niekum, W. Bradley Knox, Dorsa Sadigh
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we address the following questions about CPL: First, is CPL effective at fine-tuning policies from regret-based preferences? Second, does CPL scale to high-dimensional control problems and larger networks? Third, what ingredients of CPL are important for attaining high performance? Additional experiments with human data and details are included in the appendix. |
| Researcher Affiliation | Academia | Joey Hejna Stanford University EMAIL Rafael Rafailov Stanford University EMAIL Harshit Sikchi UT Austin EMAIL Chelsea Finn Stanford University Scott Niekum UMass Amherst W. Bradley Knox UT Austin Dorsa Sadigh Stanford University |
| Pseudocode | No | The paper describes the algorithm using mathematical formulations and descriptive text but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is released at https://github.com/jhejna/cpl |
| Open Datasets | Yes | We use six tasks from the Meta World robotics benchmark (Yu et al., 2020). ... We perform additional experiments with real-human preferences. Specifically, we adopt the benchmarks from Kim et al. (2023) which use either 100 (expert) or 500 (replay) real human preferences on datasets from the D4RL benchmark (Fu et al., 2020). |
| Dataset Splits | No | The paper describes the number of training steps and evaluation procedures (e.g., 'Every 5000 steps...we run 25 evaluation episodes'), but does not explicitly detail a separate validation split or how data is partitioned into distinct training, validation, and test sets. It discusses 'rollout data' and 'synthetic preference datasets' without specifying their internal splitting for different phases of model development. |
| Hardware Specification | Yes | Table 2: Computational efficiency of each method when learning from pixels for 200k training steps on a single Titan RTX GPU. |
| Software Dependencies | No | The paper mentions using Python and various frameworks or architectures (e.g., 'Dr Qv2'), but it does not specify exact version numbers for any software dependencies like programming languages, libraries, or specific solvers. |
| Experiment Setup | Yes | Table 5: Common Meta World Hyper-parameters. Table 6: Hyper-parameters for CPL and variants. Table 7: Hyperparameters for P-IQL and SFT for Meta World. Table 8: PPO Hyperparameters. |