Accelerating Exploration with Unlabeled Prior Data

Authors: Qiyang Li, Jason Zhang, Dibya Ghosh, Amy Zhang, Sergey Levine

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical evaluations, conducted over domains with different observation modalities (e.g., states and images), such as simulated robot navigation and arm and hand manipulation, show that our simple optimistic reward labeling strategy can utilize the unlabeled prior data effectively, often as well as the prior best approach that has access to the same prior data with labeled rewards.
Researcher Affiliation Collaboration UC Berkeleyα, UT Austinβ, Metaγ {qcli,jason.z,dibya.ghosh}@berkeley.edu amy.zhang@austin.utexas.edu, svlevine@eecs.berkeley.edu
Pseudocode Yes Algorithm 1 EXPLORE
Open Source Code No The paper states: 'Our codebase is based on the official RLPD codebase https://github.com/ikostrikov/ rlpd with minor modifications.' and 'We use the open-source implementation from the authors at https://github.com/dibyaghosh/ icvf_release'. While it references other open-source code, it does not explicitly state that the code for *their specific methodology* (EXPLORE) is open-sourced or provided.
Open Datasets Yes The environments that we evaluate our methods on are all challenging sparse reward tasks, including six D4RL Ant Maze tasks [Fu et al., 2020], three sparse-reward Adroit hand manipulation tasks [Nair et al., 2021] following the setup in RLPD [Ball et al., 2023], and three image-based robotic manipulation tasks used by COG [Singh et al., 2020b].
Dataset Splits No The paper describes how it uses online and offline data but does not specify explicit train/validation/test splits of the datasets for model training and evaluation that would allow reproduction of these splits.
Hardware Specification Yes We use Tesla V100 GPU for running the experiments.
Software Dependencies No The paper mentions software components like 'RLPD codebase', 'PyTorch', 'Adam' optimizer, but it does not provide specific version numbers for these software dependencies, which are necessary for reproducibility.
Experiment Setup Yes Parameter Value Online batch size 128 Offline batch size 128 Discount factor γ 0.99 Optimizer Adam Learning rate 3 10 4 Critic ensemble size 10 Random critic target subset size 2 for Adroit and COG and 1 for Ant Maze Gradient Steps to Online Data Ratio (UTD) 20 Network Width 256 Initial Entropy Temperature 1.0 Target Entropy dim(A)/2 Entropy Backups False Start Training after 5000 environment steps