reproducibilityindex.ai

Learning to Act from Actionless Videos through Dense Correspondences

Authors: Po-Chen Ko, Jiayuan Mao, Yilun Du, Shao-Hua Sun, Joshua B. Tenenbaum

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the efficacy of our approach in learning policies on table-top manipulation and navigation tasks. Additionally, we contribute an open-source framework for efficient video modeling, enabling the training of high-fidelity policy models with four GPUs within a single day. (...) We describe the baselines and the variants of our proposed method AVDC in Section 4.1. Then, we compare AVDC to its variants and the baselines on simulated robot arm manipulation tasks in Meta-World (Figure 4a) in Section 4.2 and simulated navigation tasks in i THOR (Figure 4b) in Section 4.3.
Researcher Affiliation	Academia	Po-Chen Ko National Taiwan University Jiayuan Mao MIT CSAIL Yilun Du MIT CSAIL Shao-Hua Sun National Taiwan University Joshua B. Tenenbaum MIT BCS, CBMM, CSAIL
Pseudocode	No	The paper describes its methods and architectures using prose and diagrams (e.g., Figure 2 for the overall framework, Figure 3 for network architecture), but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Additionally, we contribute an open-source framework for efficient video modeling, enabling the training of high-fidelity policy models with four GPUs within a single day. (...) The code for reproducing our results is included in ./codebase AVDC directory of the attached supplementary .zip file.
Open Datasets	Yes	We compare AVDC to a multi-task behavioral cloning (BC) baseline given access to a set of expert actions from all videos (15, 216 labeled frame-action pairs in Meta-World and 5, 757 in i THOR)... (Meta-World (Yu et al., 2019)... i THOR (Kolve et al., 2017)... Visual Pusher tasks (Schmeckpeper et al., 2021; Zakka et al., 2022)... train our video generation model on the Bridge dataset (Ebert et al., 2022)...
Dataset Splits	Yes	We collect 5 demonstrations per task per camera position, resulting in total 165 videos. (...) We found that this adaptable frame sampling technique significantly improves the learning efficiency of the video diffusion model, enabling the training to finish within a single day using just 4 GPUs. (...) We thus fine-tuned the diffusion model with 20 human demonstrations collected with our setup.
Hardware Specification	Yes	Additionally, we contribute an open-source framework for efficient video modeling, enabling the training of high-fidelity policy models with four GPUs within a single day. (...) We train all models on 4 V100 GPUs with 32GB memory each. (...) We conducted our experiments (inference) on a machine with an RTX 3080Ti as GPU.
Software Dependencies	No	The paper mentions software components such as 'GMFlow', 'CLIP-Text encoder', 'U-Net', and techniques like 'Denoising Diffusion Implicit Models (DDIM; Song et al., 2021)', but it does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	For all models, we use dropout=0, num head channels=32, train/inference timesteps=100, training objective=predict v, beta schedule=cosine, loss function=l2, min snr gamma=5, learning rate=1e-4, ema update steps=10, ema decay=0.999. (...) Table 7: Comparison of configuration parameters for Meta-World, i THOR, and Bridge. (...) Specifically, we set (To, Tp, Ta) = (2, 16, 8). We used a batch size of 4096 and evaluated the checkpoints at 15k, 25k and 35k training steps.