Learning to Act from Actionless Videos through Dense Correspondences
Authors: Po-Chen Ko, Jiayuan Mao, Yilun Du, Shao-Hua Sun, Joshua B. Tenenbaum
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We demonstrate the efficacy of our approach in learning policies on table-top manipulation and navigation tasks. Additionally, we contribute an open-source framework for efficient video modeling, enabling the training of high-fidelity policy models with four GPUs within a single day. (...) We describe the baselines and the variants of our proposed method AVDC in Section 4.1. Then, we compare AVDC to its variants and the baselines on simulated robot arm manipulation tasks in Meta-World (Figure 4a) in Section 4.2 and simulated navigation tasks in i THOR (Figure 4b) in Section 4.3. |
| Researcher Affiliation | Academia | Po-Chen Ko National Taiwan University Jiayuan Mao MIT CSAIL Yilun Du MIT CSAIL Shao-Hua Sun National Taiwan University Joshua B. Tenenbaum MIT BCS, CBMM, CSAIL |
| Pseudocode | No | The paper describes its methods and architectures using prose and diagrams (e.g., Figure 2 for the overall framework, Figure 3 for network architecture), but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | Additionally, we contribute an open-source framework for efficient video modeling, enabling the training of high-fidelity policy models with four GPUs within a single day. (...) The code for reproducing our results is included in ./codebase AVDC directory of the attached supplementary .zip file. |
| Open Datasets | Yes | We compare AVDC to a multi-task behavioral cloning (BC) baseline given access to a set of expert actions from all videos (15, 216 labeled frame-action pairs in Meta-World and 5, 757 in i THOR)... (Meta-World (Yu et al., 2019)... i THOR (Kolve et al., 2017)... Visual Pusher tasks (Schmeckpeper et al., 2021; Zakka et al., 2022)... train our video generation model on the Bridge dataset (Ebert et al., 2022)... |
| Dataset Splits | Yes | We collect 5 demonstrations per task per camera position, resulting in total 165 videos. (...) We found that this adaptable frame sampling technique significantly improves the learning efficiency of the video diffusion model, enabling the training to finish within a single day using just 4 GPUs. (...) We thus fine-tuned the diffusion model with 20 human demonstrations collected with our setup. |
| Hardware Specification | Yes | Additionally, we contribute an open-source framework for efficient video modeling, enabling the training of high-fidelity policy models with four GPUs within a single day. (...) We train all models on 4 V100 GPUs with 32GB memory each. (...) We conducted our experiments (inference) on a machine with an RTX 3080Ti as GPU. |
| Software Dependencies | No | The paper mentions software components such as 'GMFlow', 'CLIP-Text encoder', 'U-Net', and techniques like 'Denoising Diffusion Implicit Models (DDIM; Song et al., 2021)', but it does not provide specific version numbers for these software dependencies. |
| Experiment Setup | Yes | For all models, we use dropout=0, num head channels=32, train/inference timesteps=100, training objective=predict v, beta schedule=cosine, loss function=l2, min snr gamma=5, learning rate=1e-4, ema update steps=10, ema decay=0.999. (...) Table 7: Comparison of configuration parameters for Meta-World, i THOR, and Bridge. (...) Specifically, we set (To, Tp, Ta) = (2, 16, 8). We used a batch size of 4096 and evaluated the checkpoints at 15k, 25k and 35k training steps. |