MIMO Is All You Need:A Strong Multi-in-Multi-Out Baseline for Video Prediction

Authors: Shuliang Ning, Mengcheng Lan, Yanran Li, Chaofeng Chen, Qian Chen, Xunlai Chen, Xiaoguang Han, Shuguang Cui

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Surprisingly, our empirical studies reveal that a simple MIMO model can outperform the state-of-the-art work with a large margin much more than expected, especially in dealing with long-term error accumulation. After exploring a number of ways and designs, we propose a new MIMO architecture based on extending the pure Transformer with local spatio-temporal blocks and a new multi-output decoder, namely MIMO-VP, to establish a new standard in video prediction. We evaluate our model in four highly competitive benchmarks. Extensive experiments show that our model wins 1st place on all the benchmarks with remarkable performance gains and surpasses the best SISO model in all aspects including efficiency, quantity, and quality.
Researcher Affiliation Collaboration Shuliang Ning1,2*, Mengcheng Lan2*, Yanran Li3, Chaofeng Chen4, Qian Chen5, Xunlai Chen5 , Xiaoguang Han2,1 , Shuguang Cui2,1 1 FNii, CUHKSZ 2 SSE, CUHKSZ 3 The University of Edinburgh 4 Nanyang Technological University 5 Shenzhen Meteorological Bureau shuliangning@link.cuhk.edu.cn, {lanmengchengds, chaofenghust}@gmail.com, yli19@ed.ac.uk, {chenqian, chenxunlai}@weather.sz.gov.cn, {hanxiaoguang, shuguangcui}@cuhk.edu.cn
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The code will be released.
Open Datasets Yes Moving MNIST (Srivastava, Mansimov, and Salakhudinov 2015) is a standard synthetic dataset for video prediction. We generate training set following (Guen and Thome 2020), and adopt the widely used testing set of 10,000 sequences provided by (Srivastava, Mansimov, and Salakhudinov 2015). Human3.6M (Ionescu et al. 2013) dataset is a real-world human pose dataset comprising 17 kinds of human actions and 3.6 million poses.
Dataset Splits No The paper specifies training and testing sets but does not explicitly detail a validation split or its size/percentage, nor does it refer to predefined validation splits with citations. For example, it mentions 'We generate training set following (Guen and Thome 2020), and adopt the widely used testing set of 10,000 sequences provided by (Srivastava, Mansimov, and Salakhudinov 2015)' for Moving MNIST, and 'We use subjects S1, S5, S6, S7 and S8 for training, and subjects S9, S11 for testing' for Human3.6M.
Hardware Specification Yes We implemented the model with Py Torch and conducted experiments on NVIDIA V100 GPUs.
Software Dependencies No The paper mentions 'Py Torch' as the implementation framework, but it does not specify any version numbers for PyTorch or other critical software dependencies.
Experiment Setup Yes We use Adam (Kingma and Ba 2014) optimizer with 0.0005 learning rate and L1+L2 loss to train our model.