CompFeat: Comprehensive Feature Aggregation for Video Instance Segmentation

Authors: Yang Fu, Linjie Yang, Ding Liu, Thomas S. Huang, Humphrey Shi1361-1369

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments conducted on the You Tube-VIS dataset validate the effectiveness of proposed Comp Feat. We conduct extensive experiments and ablation study on You Tube-VIS (Yang, Fan, and Xu 2019) to demonstrate the effectiveness of our proposed framework and each of the individual components.
Researcher Affiliation Collaboration 1University of Illinois at Urbana-Champaign 2Byte Dance Inc 3University of Oregon
Pseudocode No The paper includes architectural diagrams but no explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper states 'Our model is implemented based on MMDetection (Chen et al. 2019)' but does not provide an explicit statement or link for the open-source code of their proposed Comp Feat methodology.
Open Datasets Yes Data You Tube-VIS is the first and largest dataset for video instance segmentation, which is a subset of You Tube-VOS dataset (Xu et al. 2018). ... We choose MSCOCO (Lin et al. 2014) as external data which has a large overlap on the object categories with You Tube-VIS.
Dataset Splits Yes Since only the validation set is available for evaluation, all results reported in this paper are evaluated on the validation set.
Hardware Specification Yes Our model is implemented based on MMDetection (Chen et al. 2019) and the whole framework is trained end-to-end in 12 epochs with two NVIDIA 2080TI GPUs.
Software Dependencies No The paper states 'Our model is implemented based on MMDetection (Chen et al. 2019)' but does not provide specific version numbers for software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes During training, the initial learning rate is set to 0.0125 and decays with a factor of 10 at epoch 8 and 11. For each input frame, we randomly select three frames from the same video, two used as support frames in the dual attention module and the other used as reference frame in the tracking module.