Reward Uncertainty for Exploration in Preference-based Reinforcement Learning

Authors: Xinran Liang, Katherine Shu, Kimin Lee, Pieter Abbeel

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that exploration bonus from uncertainty in learned reward improves both feedbackand sample-efficiency of preference-based RL algorithms on complex robot manipulation tasks from Meta World benchmarks, compared with other existing exploration methods that measure the novelty of state visitation.
Researcher Affiliation Academia Xinran Liang1, Katherine Shu1, Kimin Lee1 , Pieter Abbeel1 1University of California, Berkeley
Pseudocode Yes The full procedure of RUNE is summarized in Algorithm 1. [...] Algorithm 1 RUNE: Reward Uncertainty for Exploration
Open Source Code No The paper mentions using publicly released implementations of the PEBBLE algorithm (https://github.com/pokaxpoka/B_Pref), SAC algorithm (https://github.com/denisyarats/pytorch_sac), RE3 (https://github.com/younggyoseo/RE3), and URLB (https://anonymous.4open.science/r/urlb). However, it does not explicitly state that the source code for RUNE itself, developed by the authors of this paper, is open-source or available.
Open Datasets Yes In order to verify the efficacy of exploration in preference-based RL, we focus on having an agent solve a range of complex robotic manipulation skills from Meta-World (Yu et al., 2020)
Dataset Splits No The paper uses the Meta-World benchmark but does not specify explicit training, validation, and test dataset splits (e.g., percentages or sample counts) needed for reproduction. It mentions using an "oracle scripted teacher" for feedback.
Hardware Specification No The paper does not provide any specific details about the hardware (e.g., GPU models, CPU types, memory) used to run the experiments.
Software Dependencies No The paper mentions using an "Optimizer Adam (Kingma & Ba, 2015)" and refers to public repositories for PEBBLE, SAC, RE3, and URLB algorithms. However, it does not specify version numbers for any of these software dependencies (e.g., PyTorch version, Python version, specific library versions), which are crucial for reproducible setup.
Experiment Setup Yes Table 2: Hyperparameters of the PEBBLE algorithm. Table 3: Hyperparameters of the SAC algorithm. For all methods we consider, we carefully tune a range of hyperparameters and report the best results. In particular, we consider β0 = 0.05 and ρ {0.001, 0.0001, 0.00001} for all exploration methods, and k {5, 10} for state entropy based exploration.