Parrot: Data-Driven Behavioral Priors for Reinforcement Learning
Authors: Avi Singh, Huihan Liu, Gaoyue Zhou, Albert Yu, Nicholas Rhinehart, Sergey Levine
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments seek to answer: (1) Can the behavioral prior accelerate learning of new tasks? (2) How does PARROT compare to prior works that accelerate RL with demonstrations? (3) How does PARROT compare to prior methods that combine hierarchical imitation with RL? ... Our results are summarised in Figure 5. We see that PARROT is able to solve all of the tasks substantially faster and achieve substantially higher final returns than other methods. |
| Researcher Affiliation | Academia | Avi Singh , Huihan Liu , Gaoyue Zhou, Albert Yu, Nicholas Rhinehart, Sergey Levine University of California, Berkeley Equal contribution. Correspondence to Avi Singh (avisingh@berkeley.edu). |
| Pseudocode | Yes | Algorithm 1 RL with Behavioral Priors ... Algorithm 2 Scripted Grasping ... Algorithm 3 Scripted Pick and Place |
| Open Source Code | No | Additional materials can be found on our project website: https://sites.google.com/view/parrot-rl (This link leads to a project website, not an explicit code repository or code release statement.) |
| Open Datasets | Yes | To collect data in diverse environments, we used 3D object models from the Shape Net dataset (Chang et al., 2015) and the Py Bullet (Coumans & Bai, 2016) object libraries. |
| Dataset Splits | No | The paper discusses training and testing, but does not explicitly provide details on validation splits (e.g., specific percentages or sample counts for validation data). |
| Hardware Specification | No | No specific hardware details (e.g., GPU/CPU models, memory specifications) are provided for running the experiments. |
| Software Dependencies | No | The paper mentions tools and algorithms like Adam optimizer, Soft Actor-Critic, and PyBullet, but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | We use a learning rate of 1e 4 and the Adam (Kingma & Ba, 2015) optimizer to train the behavioral prior for 500K steps. ... Table 1: Hyperparameters for soft-actor critic (SAC) Hyperparameter value used Target network update period 1000 steps discount factor γ 0.99 policy learning rate 3e 4 Q-function learning rate 3e 4 reward scale 1.0 automatic entropy tuning enabled number of update steps per env step 1 |