Listwise Reward Estimation for Offline Preference-based Reinforcement Learning
Authors: Heewoong Choi, Sangwon Jung, Hongjoon Ahn, Taesup Moon
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our extensive experiments on the dataset demonstrate the superiority of Li RE, i.e., outperforming state-of-the-art baselines even with modest feedback budgets and enjoying robustness with respect to the number of feedbacks and feedback noise. |
| Researcher Affiliation | Academia | Heewoong Choi 1 Sangwon Jung 1 Hongjoon Ahn 1 Taesup Moon 1 2 1Department of Electrical and Computer Engineering, Seoul National University 2ASRI/INMC/IPAI/AIIS, Seoul National University. Correspondence to: Taesup Moon <tsmoon@snu.ac.kr>. |
| Pseudocode | Yes | The RLT construction algorithm is based on a binary insertion sort and the pseudocode is summarized in Algorithm 1 (Appendix). |
| Open Source Code | Yes | Our code is available at https://github.com/chwoong/Li RE |
| Open Datasets | Yes | To that end, we newly collect the offline Pb RL dataset with Meta-World (Yu et al., 2020) and Deep Mind Control Suite (DMControl) (Tassa et al., 2018) following the protocols of previous work... |
| Dataset Splits | No | The paper describes the collection of datasets (medium-replay, medium-expert) and their use in offline RL. However, it does not provide explicit training, validation, and test splits (e.g., as percentages or specific sample counts) of these newly collected datasets needed for reproduction. |
| Hardware Specification | Yes | We use a single NVIDIA RTX A5000 GPU and 32 CPU cores (AMD EPYC 7513 @ 2.60GHz) in our experiments. |
| Software Dependencies | No | The paper mentions software like 'CORL (Tarasov et al., 2023)', 'PEBBLE (Lee et al., 2021b)', 'DPPO', and 'IPL', and specifies optimizers like 'Adam'. However, it does not provide specific version numbers for these software components or libraries required for reproducibility. |
| Experiment Setup | Yes | Table 18: Hyperparameters of the reward model and the baselines. |