reproducibilityindex.ai

MaskViT: Masked Visual Pre-Training for Video Prediction

Authors: Agrim Gupta, Stephen Tian, Yunzhi Zhang, Jiajun Wu, Roberto Martín-Martín, Li Fei-Fei

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	On several datasets we demonstrate that Mask Vi T outperforms prior works in video prediction, is parameter efficient, generates high-resolution videos (256 256) and can be easily adapted to perform goal-conditioned video prediction. Further, we demonstrate the benefits of inference speedup (up to 512 ) due to iterative decoding by using Mask Vi T for planning on a real robot. Our work suggests that we can endow embodied agents with powerful predictive models by leveraging the general framework of masked visual modeling with minimal domain knowledge.
Researcher Affiliation	Collaboration	Agrim Gupta1 , Stephen Tian1, Yunzhi Zhang1, Jiajun Wu1, Roberto Mart ın-Mart ın1,2,3, Li Fei-Fei1 1Stanford University, 2UT Austin, 3Salesforce AI
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	We use an open-source implementation of VQ-GAN (https://github.com/Comp Vis/ taming-transformers) for all our experiments. The paper mentions a project page for videos but does not provide concrete access to its own source code for Mask Vi T.
Open Datasets	Yes	Through experiments on several publicly available real-world video prediction datasets (Ebert et al., 2017; Geiger et al., 2013; Dasari et al., 2019)
Dataset Splits	No	The paper mentions following evaluation protocols of prior work and using a "test set" for evaluation, but it does not provide explicit details about the train/validation/test dataset splits (e.g., percentages, sample counts, or formal citations for the specific splits used) for its experiments.
Hardware Specification	Yes	Our robot setup consists of a Sawyer robot arm with a Logitech C922 PRO consumer webcam for recording video frames at 640 480 resolution...Model inference for real robot control is performed using 8 NVIDIA RTX 3090 GPUs with a batch size of 16 per GPU.
Software Dependencies	Yes	We use Py Torch (Paszke et al., 2019) 1.7 library for implementing Mask Vi T.
Experiment Setup	Yes	Implementation. Our transformer model is a stack of L blocks, where each block consists of two transformer layers with attention restricted to the window size of 1 16 16 (spatial window) and T 4 4 (spatiotemporal window), unless otherwise specified. We use learnable positional embeddings, which are the sum of space and time positional embeddings. See A.1 for architecture details and hyperparameters...Table 5: Training and evaluation hyperparameters...Table 6: Hyperparameters for visual-MPC.