MaskViT: Masked Visual Pre-Training for Video Prediction
Authors: Agrim Gupta, Stephen Tian, Yunzhi Zhang, Jiajun Wu, Roberto Martín-Martín, Li Fei-Fei
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | On several datasets we demonstrate that Mask Vi T outperforms prior works in video prediction, is parameter efficient, generates high-resolution videos (256 256) and can be easily adapted to perform goal-conditioned video prediction. Further, we demonstrate the benefits of inference speedup (up to 512 ) due to iterative decoding by using Mask Vi T for planning on a real robot. Our work suggests that we can endow embodied agents with powerful predictive models by leveraging the general framework of masked visual modeling with minimal domain knowledge. |
| Researcher Affiliation | Collaboration | Agrim Gupta1 , Stephen Tian1, Yunzhi Zhang1, Jiajun Wu1, Roberto Mart ın-Mart ın1,2,3, Li Fei-Fei1 1Stanford University, 2UT Austin, 3Salesforce AI |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | We use an open-source implementation of VQ-GAN (https://github.com/Comp Vis/ taming-transformers) for all our experiments. The paper mentions a project page for videos but does not provide concrete access to its own source code for Mask Vi T. |
| Open Datasets | Yes | Through experiments on several publicly available real-world video prediction datasets (Ebert et al., 2017; Geiger et al., 2013; Dasari et al., 2019) |
| Dataset Splits | No | The paper mentions following evaluation protocols of prior work and using a "test set" for evaluation, but it does not provide explicit details about the train/validation/test dataset splits (e.g., percentages, sample counts, or formal citations for the specific splits used) for its experiments. |
| Hardware Specification | Yes | Our robot setup consists of a Sawyer robot arm with a Logitech C922 PRO consumer webcam for recording video frames at 640 480 resolution...Model inference for real robot control is performed using 8 NVIDIA RTX 3090 GPUs with a batch size of 16 per GPU. |
| Software Dependencies | Yes | We use Py Torch (Paszke et al., 2019) 1.7 library for implementing Mask Vi T. |
| Experiment Setup | Yes | Implementation. Our transformer model is a stack of L blocks, where each block consists of two transformer layers with attention restricted to the window size of 1 16 16 (spatial window) and T 4 4 (spatiotemporal window), unless otherwise specified. We use learnable positional embeddings, which are the sum of space and time positional embeddings. See A.1 for architecture details and hyperparameters...Table 5: Training and evaluation hyperparameters...Table 6: Hyperparameters for visual-MPC. |