Reinforcement Learning with Action-Free Pre-Training from Videos
Authors: Younggyo Seo, Kimin Lee, Stephen L James, Pieter Abbeel
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We designed our experiments to investigate the following: Can APV improve the sample-efficiency of vision-based RL in robotic manipulation tasks by performing actionfree pre-training on videos from different domains? Can representations pre-trained on videos from manipulation tasks transfer to locomotion tasks? How does APV compare to a na ıve fine-tuning scheme? What is the contribution of each of the proposed techniques in APV? How does pre-trained representations qualitatively differ from the randomly initialized representations? How does APV perform when additional in-domain videos or real-world natural videos are available? Figure 5 shows the learning curves of APV pre-trained using the RLBench videos on six robotic manipulation tasks from Meta-world. We find that APV consistently outperforms Dreamer V2 in terms of sample-efficiency in all considered tasks. |
| Researcher Affiliation | Collaboration | 1KAIST 2Work done while visiting UC Berkeley 3UC Berkeley 4Now at Google Research. Correspondence to: Younggyo Seo <younggyo.seo@kaist.ac.kr>. |
| Pseudocode | No | The paper includes mathematical formulations and architectural diagrams (e.g., Figure 2 and 3) but does not provide any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https: //github.com/younggyoseo/apv. |
| Open Datasets | Yes | We first evaluate APV on various vision-based robotic manipulation tasks from Metaworld (Yu et al., 2020). We use videos collected in robotic manipulation tasks from RLBench (James et al., 2020) as pre-training data. We also consider widely used robotic locomotion tasks from Deep Mind Control Suite (Tassa et al., 2020). Specifically, for real-world videos, we utilize Something-Something-V2 dataset (Goyal et al., 2017). |
| Dataset Splits | No | The paper discusses training and testing, and mentions 'validation' in the context of the Dreamer V2 behavior learning scheme and architectural components ('validation' RSSM), but does not explicitly provide details about a distinct validation dataset split (e.g., percentages, sample counts, or methodology for partitioning data for validation purposes) for reproducing the experiment. |
| Hardware Specification | Yes | We use a single Nvidia RTX3090 GPU and 10 CPU cores for each training run. |
| Software Dependencies | No | The paper states: 'We build our framework on top of the official implementation of Dreamer V2, which is based on Tensor Flow (Abadi et al., 2016).' However, it does not provide a specific version number for TensorFlow or any other key software dependencies required for replication. |
| Experiment Setup | Yes | For newly introduced hyperparameters, we use βz = 1.0 for pre-training, and βz = 0, β = 1.0 for fine-tuning. We use τ = 5 for computing the intrinsic bonus. To make the scale of intrinsic bonus be 10% of extrinsic reward, we normalize the intrinsic reward and use λ = 0.1, 1.0 for manipulation and locomotion tasks, respectively. We find that increasing the hidden size of dense layers and the model state dimension from 200 to 1024 improves the performance of both APV and Dreamer V2. We use T = 25, 50 for manipulation and locomotion tasks, respectively, during pre-training. Unless otherwise specified, we use the default hyperparameters of Dreamer V2. |