Hybrid Instance-Aware Temporal Fusion for Online Video Instance Segmentation

Authors: Xiang Li, Jinglu Wang, Xiao Li, Yan Lu1429-1437

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments have been conducted on popular VIS datasets, i.e. Youtube-VIS-19/21. Our model achieves the best performance among all online VIS methods. We conduct extensive ablation studies on Youtube-VIS-2019 to show the effectiveness of different components of our method.
Researcher Affiliation Collaboration Xiang Li, 1,2* Jinglu Wang, 2 Xiao Li,2 Yan Lu 2 1 Department of Electrical and Computer Engineering, Carnegie Mellon University 2 Microsoft Research Asia xl6@andrew.cmu.edu, {jinglwa, xili11, yanlu}@microsoft.com
Pseudocode No No clearly labeled pseudocode or algorithm blocks were found in the paper.
Open Source Code No The paper does not contain an explicit statement about releasing the code for the described methodology or a link to a code repository.
Open Datasets Yes We evaluate our method on two extensively used VIS datasets Youtube-VIS-2019 and Youtube-VIS-2021.
Dataset Splits Yes Youtube-VIS-2019 has 40 categories, 4,883 unique video instances, and 131k highquality manual annotations. There are 2,238 training videos, 302 validation videos, and 343 test videos in it. Youtube-VIS-2021 is an improved version of the Youtube-VIS-2019 dataset, which contains 8,171 unique video instances and 232k high-quality manual annotations. There are 2,985 training videos, 421 validation videos, and 453 test videos in this dataset.
Hardware Specification No No specific hardware details (such as GPU/CPU models, memory, or type of computing cluster) used for running the experiments were mentioned in the paper.
Software Dependencies No The paper only mentions the "Tensorflow2 framework" without specifying a version number or listing other software dependencies with their versions.
Experiment Setup Yes All frames are resized and padded to 641 641 during training and inference. We train our model for 35k iterations with a poly learning rate policy where the learning rate is multiplied by (1 iter itermax )0.9 for each iteration with an initial learning rate of 0.001 to all experiments. The batchsize = 32 and an adam (Kingma and Ba 2014) optimizer with β1 = 0.9, β2 = 0.999 and weight decay = 0 is leveraged. Multi-scale training is adopted to obtain a strong baseline. We select adjacent three frames as reference frames if not specific.