MaskViT: Masked Visual Pre-Training for Video Prediction

Authors: Agrim Gupta, Stephen Tian, Yunzhi Zhang, Jiajun Wu, Roberto Martín-Martín, Li Fei-Fei

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental On several datasets we demonstrate that Mask Vi T outperforms prior works in video prediction, is parameter efficient, generates high-resolution videos (256 256) and can be easily adapted to perform goal-conditioned video prediction. Further, we demonstrate the benefits of inference speedup (up to 512 ) due to iterative decoding by using Mask Vi T for planning on a real robot. Our work suggests that we can endow embodied agents with powerful predictive models by leveraging the general framework of masked visual modeling with minimal domain knowledge.
Researcher Affiliation Collaboration Agrim Gupta1 , Stephen Tian1, Yunzhi Zhang1, Jiajun Wu1, Roberto Mart ın-Mart ın1,2,3, Li Fei-Fei1 1Stanford University, 2UT Austin, 3Salesforce AI
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No We use an open-source implementation of VQ-GAN (https://github.com/Comp Vis/ taming-transformers) for all our experiments. The paper mentions a project page for videos but does not provide concrete access to its own source code for Mask Vi T.
Open Datasets Yes Through experiments on several publicly available real-world video prediction datasets (Ebert et al., 2017; Geiger et al., 2013; Dasari et al., 2019)
Dataset Splits No The paper mentions following evaluation protocols of prior work and using a "test set" for evaluation, but it does not provide explicit details about the train/validation/test dataset splits (e.g., percentages, sample counts, or formal citations for the specific splits used) for its experiments.
Hardware Specification Yes Our robot setup consists of a Sawyer robot arm with a Logitech C922 PRO consumer webcam for recording video frames at 640 480 resolution...Model inference for real robot control is performed using 8 NVIDIA RTX 3090 GPUs with a batch size of 16 per GPU.
Software Dependencies Yes We use Py Torch (Paszke et al., 2019) 1.7 library for implementing Mask Vi T.
Experiment Setup Yes Implementation. Our transformer model is a stack of L blocks, where each block consists of two transformer layers with attention restricted to the window size of 1 16 16 (spatial window) and T 4 4 (spatiotemporal window), unless otherwise specified. We use learnable positional embeddings, which are the sum of space and time positional embeddings. See A.1 for architecture details and hyperparameters...Table 5: Training and evaluation hyperparameters...Table 6: Hyperparameters for visual-MPC.