Listwise Reward Estimation for Offline Preference-based Reinforcement Learning

Authors: Heewoong Choi, Sangwon Jung, Hongjoon Ahn, Taesup Moon

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our extensive experiments on the dataset demonstrate the superiority of Li RE, i.e., outperforming state-of-the-art baselines even with modest feedback budgets and enjoying robustness with respect to the number of feedbacks and feedback noise.
Researcher Affiliation Academia Heewoong Choi 1 Sangwon Jung 1 Hongjoon Ahn 1 Taesup Moon 1 2 1Department of Electrical and Computer Engineering, Seoul National University 2ASRI/INMC/IPAI/AIIS, Seoul National University. Correspondence to: Taesup Moon <tsmoon@snu.ac.kr>.
Pseudocode Yes The RLT construction algorithm is based on a binary insertion sort and the pseudocode is summarized in Algorithm 1 (Appendix).
Open Source Code Yes Our code is available at https://github.com/chwoong/Li RE
Open Datasets Yes To that end, we newly collect the offline Pb RL dataset with Meta-World (Yu et al., 2020) and Deep Mind Control Suite (DMControl) (Tassa et al., 2018) following the protocols of previous work...
Dataset Splits No The paper describes the collection of datasets (medium-replay, medium-expert) and their use in offline RL. However, it does not provide explicit training, validation, and test splits (e.g., as percentages or specific sample counts) of these newly collected datasets needed for reproduction.
Hardware Specification Yes We use a single NVIDIA RTX A5000 GPU and 32 CPU cores (AMD EPYC 7513 @ 2.60GHz) in our experiments.
Software Dependencies No The paper mentions software like 'CORL (Tarasov et al., 2023)', 'PEBBLE (Lee et al., 2021b)', 'DPPO', and 'IPL', and specifies optimizers like 'Adam'. However, it does not provide specific version numbers for these software components or libraries required for reproducibility.
Experiment Setup Yes Table 18: Hyperparameters of the reward model and the baselines.