reproducibilityindex.ai

Video Instance Segmentation using Inter-Frame Communication Transformers

Authors: Sukjun Hwang, Miran Heo, Seoung Wug Oh, Seon Joo Kim

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate our method on the latest benchmark sets and achieved state-of-the-art performance (AP 42.6 on You Tube-VIS 2019 val set using the ofﬂine inference) while having a considerably fast runtime (89.4 FPS).
Researcher Affiliation	Collaboration	Sukjun Hwang1 Miran Heo1 Seoung Wug Oh2 Seon Joo Kim1 1Yonsei University 2Adobe Research
Pseudocode	No	The paper describes the model architecture and processes using descriptive text and diagrams (Figure 1), but does not include any formal pseudocode or algorithm blocks.
Open Source Code	Yes	The code is available at https://github.com/sukjunhwang/IFC.
Open Datasets	Yes	We validate our method on the latest benchmark sets and achieved state-of-the-art performance (AP 42.6 on You Tube-VIS 2019 val set using the ofﬂine inference)... We used detectron2 [33] for our code basis... We ﬁrst pre-train the model for image instance segmentation on COCO [35] by setting our model to T = 1. ...the models are trained on a targeted dataset using the batch size of 16, each clip composed of T = 5 frames downscaled to either 360p or 480p.
Dataset Splits	Yes	We validate our method on the latest benchmark sets and achieved state-of-the-art performance (AP 42.6 on You Tube-VIS 2019 val set using the ofﬂine inference)... We ﬁrst pre-train the model for image instance segmentation on COCO [35] by setting our model to T = 1. The pre-train procedure follows the shortened training schedule of DETR [13], which runs 300 epochs with a decay of the learning rate by a factor of 10 at 200 epochs. Using the pre-trained weights, the models are trained on a targeted dataset using the batch size of 16, each clip composed of T = 5 frames downscaled to either 360p or 480p.
Hardware Specification	Yes	For fairness, FPS is measured on a same machine, using a single RTX 2080Ti GPU.
Software Dependencies	Yes	We used detectron2 [33] for our code basis... Measured using flop_count function of fvcore==0.1.5.
Experiment Setup	Yes	Unless speciﬁed, all models for measurements used NE = 3, ND = 3, stride of 1, and Res Net-50. We used Adam W [34] optimizer with initial learning rate of 10 4 for transformers, and 10 5 for backbone. We ﬁrst pre-train the model for image instance segmentation on COCO [35] by setting our model to T = 1. The pre-train procedure follows the shortened training schedule of DETR [13], which runs 300 epochs with a decay of the learning rate by a factor of 10 at 200 epochs. Using the pre-trained weights, the models are trained on a targeted dataset using the batch size of 16, each clip composed of T = 5 frames downscaled to either 360p or 480p. For the sampling of each clip, a reference frame index t is randomly chosen. The remaining T 1 frame indices are then sampled within an interval of 20. The models are trained for 8 epochs, and decays the learning rate by 10 at 5th epoch.