Video Instance Segmentation using Inter-Frame Communication Transformers

Authors: Sukjun Hwang, Miran Heo, Seoung Wug Oh, Seon Joo Kim

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate our method on the latest benchmark sets and achieved state-of-the-art performance (AP 42.6 on You Tube-VIS 2019 val set using the offline inference) while having a considerably fast runtime (89.4 FPS).
Researcher Affiliation Collaboration Sukjun Hwang1 Miran Heo1 Seoung Wug Oh2 Seon Joo Kim1 1Yonsei University 2Adobe Research
Pseudocode No The paper describes the model architecture and processes using descriptive text and diagrams (Figure 1), but does not include any formal pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/sukjunhwang/IFC.
Open Datasets Yes We validate our method on the latest benchmark sets and achieved state-of-the-art performance (AP 42.6 on You Tube-VIS 2019 val set using the offline inference)... We used detectron2 [33] for our code basis... We first pre-train the model for image instance segmentation on COCO [35] by setting our model to T = 1. ...the models are trained on a targeted dataset using the batch size of 16, each clip composed of T = 5 frames downscaled to either 360p or 480p.
Dataset Splits Yes We validate our method on the latest benchmark sets and achieved state-of-the-art performance (AP 42.6 on You Tube-VIS 2019 val set using the offline inference)... We first pre-train the model for image instance segmentation on COCO [35] by setting our model to T = 1. The pre-train procedure follows the shortened training schedule of DETR [13], which runs 300 epochs with a decay of the learning rate by a factor of 10 at 200 epochs. Using the pre-trained weights, the models are trained on a targeted dataset using the batch size of 16, each clip composed of T = 5 frames downscaled to either 360p or 480p.
Hardware Specification Yes For fairness, FPS is measured on a same machine, using a single RTX 2080Ti GPU.
Software Dependencies Yes We used detectron2 [33] for our code basis... Measured using flop_count function of fvcore==0.1.5.
Experiment Setup Yes Unless specified, all models for measurements used NE = 3, ND = 3, stride of 1, and Res Net-50. We used Adam W [34] optimizer with initial learning rate of 10 4 for transformers, and 10 5 for backbone. We first pre-train the model for image instance segmentation on COCO [35] by setting our model to T = 1. The pre-train procedure follows the shortened training schedule of DETR [13], which runs 300 epochs with a decay of the learning rate by a factor of 10 at 200 epochs. Using the pre-trained weights, the models are trained on a targeted dataset using the batch size of 16, each clip composed of T = 5 frames downscaled to either 360p or 480p. For the sampling of each clip, a reference frame index t is randomly chosen. The remaining T 1 frame indices are then sampled within an interval of 20. The models are trained for 8 epochs, and decays the learning rate by 10 at 5th epoch.