reproducibilityindex.ai

Listwise Reward Estimation for Offline Preference-based Reinforcement Learning

Authors: Heewoong Choi, Sangwon Jung, Hongjoon Ahn, Taesup Moon

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our extensive experiments on the dataset demonstrate the superiority of Li RE, i.e., outperforming state-of-the-art baselines even with modest feedback budgets and enjoying robustness with respect to the number of feedbacks and feedback noise.
Researcher Affiliation	Academia	Heewoong Choi 1 Sangwon Jung 1 Hongjoon Ahn 1 Taesup Moon 1 2 1Department of Electrical and Computer Engineering, Seoul National University 2ASRI/INMC/IPAI/AIIS, Seoul National University. Correspondence to: Taesup Moon <tsmoon@snu.ac.kr>.
Pseudocode	Yes	The RLT construction algorithm is based on a binary insertion sort and the pseudocode is summarized in Algorithm 1 (Appendix).
Open Source Code	Yes	Our code is available at https://github.com/chwoong/Li RE
Open Datasets	Yes	To that end, we newly collect the offline Pb RL dataset with Meta-World (Yu et al., 2020) and Deep Mind Control Suite (DMControl) (Tassa et al., 2018) following the protocols of previous work...
Dataset Splits	No	The paper describes the collection of datasets (medium-replay, medium-expert) and their use in offline RL. However, it does not provide explicit training, validation, and test splits (e.g., as percentages or specific sample counts) of these newly collected datasets needed for reproduction.
Hardware Specification	Yes	We use a single NVIDIA RTX A5000 GPU and 32 CPU cores (AMD EPYC 7513 @ 2.60GHz) in our experiments.
Software Dependencies	No	The paper mentions software like 'CORL (Tarasov et al., 2023)', 'PEBBLE (Lee et al., 2021b)', 'DPPO', and 'IPL', and specifies optimizers like 'Adam'. However, it does not provide specific version numbers for these software components or libraries required for reproducibility.
Experiment Setup	Yes	Table 18: Hyperparameters of the reward model and the baselines.