Predicting Scene Parsing and Motion Dynamics in the Future

Authors: Xiaojie Jin, Huaxin Xiao, Xiaohui Shen, Jimei Yang, Zhe Lin, Yunpeng Chen, Zequn Jie, Jiashi Feng, Shuicheng Yan

NeurIPS 2017 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we propose a novel model to simultaneously predict scene parsing and optical flow in unobserved future video frames. To our best knowledge, this is the first attempt in jointly predicting scene parsing and motion dynamics. By exploiting this mutually beneficial relationship, our model shows significantly better parsing and motion prediction results when compared to well-established baselines and individual prediction models on the large-scale Cityscapes dataset. In addition, we also demonstrate that our model can be used to predict the steering angle of the vehicles, which further verifies the ability of our model to learn latent representations of scene dynamics. Taking Cityscapes [5] as testbed, we conduct extensive experiments to verify the effectiveness of our model in future prediction. Our model significantly improves m Io U of parsing predictions and reduces the endpoint error (EPE) of flow predictions compared to strongly competitive baselines including a warping method based on optical flow, standalone parsing prediction or flow prediction and other state-of-the-arts methods [22]. We also present how to predict steering angles using the proposed model.
Researcher Affiliation Collaboration Xiaojie Jin1, Huaxin Xiao2, Xiaohui Shen3, Jimei Yang3, Zhe Lin3 Yunpeng Chen2, Zequn Jie4, Jiashi Feng2, Shuicheng Yan5,2 1NUS Graduate School for Integrative Science and Engineering (NGS), NUS 2Department of ECE, NUS 3Adobe Research 4Tencent AI Lab 5Qihoo 360 AI Institute
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any explicit statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes We verify our model on the large scale Cityscapes [5] dataset which contains 2,975/500 train/val video sequences with 19 semantic classes. [5] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. arXiv preprint arXiv:1604.01685, 2016.
Dataset Splits Yes We verify our model on the large scale Cityscapes [5] dataset which contains 2,975/500 train/val video sequences with 19 semantic classes. The original frames are firstly downsampled to the resolution of 256 512 to accelerate training. In the flow anticipating network, we assign 19 semantic classes into three object groups which are defined as follows: MOV-OBJ including person, rider, car, truck, bus, train, motorcycle and bicycle, STA-OBJ including road, sidewalk, sky, pole, traffic light and traffic sign and OTH-OBJ including building, wall, fence, terrain and vegetation. We randomly sample 50K/5K frames from the train set for training and validation purpose.
Hardware Specification Yes All of our experiments are carried out on NVIDIA Titan X GPUs using the Caffe library.
Software Dependencies No The paper mentions 'Caffe library' but does not specify its version or any other software dependencies with version numbers.
Experiment Setup Yes Throughout the experiments, we set the length of the input sequence as 4 frames, i.e. k = 4 in Xt k:t 1 and St k:t 1 (ref. Sec. 3). The original frames are firstly downsampled to the resolution of 256 512 to accelerate training. For data augmentation, we randomly crop a patch with the size of 256 256 and perform random mirror for all networks.