Every Frame Counts: Joint Learning of Video Segmentation and Optical Flow

Authors: Mingyu Ding, Zhe Wang, Bolei Zhou, Jianping Shi, Zhiwu Lu, Ping Luo10713-10720

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that the proposed model makes the video semantic segmentation and optical flow estimation benefit from each other and outperforms existing methods under the same settings in both tasks. ... We evaluate our framework for video semantic segmentation on the Cityscapes (Cordts et al. 2016) and Cam Vid datasets (Brostow, Fauqueur, and Cipolla 2009). We also report our competitive results for optical flow estimation on the KITTI dataset (Geiger, Lenz, and Urtasun 2012). ... The optical flow performance for the KITTI dataset is measured by the average end-point-error (EPE) score. Implementation Details Our framework is not limited to specific CNN architectures. ... Ablation Study To further evaluate the effectiveness of the proposed components, i.e., the joint learning, the temporally consistent constraint, the occlusion masks, and the unlabeled data, we conduct ablation studies on both the segmentation and optical flow tasks.
Researcher Affiliation Collaboration Mingyu Ding,1,3 Zhe Wang,5 Bolei Zhou,4 Jianping Shi,5 Zhiwu Lu,1,2 Ping Luo3 1Gaoling School of Artificial Intelligence, Renmin University of China, Beijing 100872, China 2Beijing Key Laboratory of Big Data Management and Analysis Methods, Beijing 100872, China 3The University of Hong Kong, 4The Chinese University of Hong Kong, 5Sense Time Research
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any statement about releasing source code or a link to a code repository.
Open Datasets Yes Datasets We evaluate our framework for video semantic segmentation on the Cityscapes (Cordts et al. 2016) and Cam Vid datasets (Brostow, Fauqueur, and Cipolla 2009). We also report our competitive results for optical flow estimation on the KITTI dataset (Geiger, Lenz, and Urtasun 2012).
Dataset Splits Yes Cityscapes (Cordts et al. 2016) contains 5,000 sparsely labeled snippets collected from 50 cities in different seasons, which are divided into sets with numbers 2,975, 500, and 1,525 for training, validation and testing. ... Cam Vid (Brostow, Fauqueur, and Cipolla 2009) ... We follow the same split in (Kundu, Vineet, and Koltun 2016; Nilsson and Sminchisescu 2018) with 367 training images, 100 validation images and 233 test images.
Hardware Specification Yes We take a mini-batch size of 16 on 16 TITAN Xp GPUs with synchronous Batch Normalization.
Software Dependencies No The paper mentions using PSPNet and Flow Net S as baseline networks and SGD for optimization, but it does not specify versions for any software libraries or frameworks (e.g., PyTorch, TensorFlow, CUDA).
Experiment Setup Yes The loss weights are set to be λcons = 10, λocc = 0.4 and λsm = 0.5 for all experiments. During training, we randomly choose ten pairs of images with Δt [1, 5] from one snippet, five of which contain images with ground truths. The training images are randomly cropped to 713 713. We also perform random scaling, rotation, flip and other color augmentations for data augmentation. The network is optimized by SGD, where momentum and weight decay are set to 0.9 and 0.0001 respectively. We take a mini-batch size of 16 on 16 TITAN Xp GPUs with synchronous Batch Normalization. We use the poly learning rate policy and set base learning rate to 0.01 and power to 0.9, as in (Zhao et al. 2017). The iteration number for training process is set to 120K.