Video Instance Segmentation using Inter-Frame Communication Transformers
Authors: Sukjun Hwang, Miran Heo, Seoung Wug Oh, Seon Joo Kim
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We validate our method on the latest benchmark sets and achieved state-of-the-art performance (AP 42.6 on You Tube-VIS 2019 val set using the offline inference) while having a considerably fast runtime (89.4 FPS). |
| Researcher Affiliation | Collaboration | Sukjun Hwang1 Miran Heo1 Seoung Wug Oh2 Seon Joo Kim1 1Yonsei University 2Adobe Research |
| Pseudocode | No | The paper describes the model architecture and processes using descriptive text and diagrams (Figure 1), but does not include any formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at https://github.com/sukjunhwang/IFC. |
| Open Datasets | Yes | We validate our method on the latest benchmark sets and achieved state-of-the-art performance (AP 42.6 on You Tube-VIS 2019 val set using the offline inference)... We used detectron2 [33] for our code basis... We first pre-train the model for image instance segmentation on COCO [35] by setting our model to T = 1. ...the models are trained on a targeted dataset using the batch size of 16, each clip composed of T = 5 frames downscaled to either 360p or 480p. |
| Dataset Splits | Yes | We validate our method on the latest benchmark sets and achieved state-of-the-art performance (AP 42.6 on You Tube-VIS 2019 val set using the offline inference)... We first pre-train the model for image instance segmentation on COCO [35] by setting our model to T = 1. The pre-train procedure follows the shortened training schedule of DETR [13], which runs 300 epochs with a decay of the learning rate by a factor of 10 at 200 epochs. Using the pre-trained weights, the models are trained on a targeted dataset using the batch size of 16, each clip composed of T = 5 frames downscaled to either 360p or 480p. |
| Hardware Specification | Yes | For fairness, FPS is measured on a same machine, using a single RTX 2080Ti GPU. |
| Software Dependencies | Yes | We used detectron2 [33] for our code basis... Measured using flop_count function of fvcore==0.1.5. |
| Experiment Setup | Yes | Unless specified, all models for measurements used NE = 3, ND = 3, stride of 1, and Res Net-50. We used Adam W [34] optimizer with initial learning rate of 10 4 for transformers, and 10 5 for backbone. We first pre-train the model for image instance segmentation on COCO [35] by setting our model to T = 1. The pre-train procedure follows the shortened training schedule of DETR [13], which runs 300 epochs with a decay of the learning rate by a factor of 10 at 200 epochs. Using the pre-trained weights, the models are trained on a targeted dataset using the batch size of 16, each clip composed of T = 5 frames downscaled to either 360p or 480p. For the sampling of each clip, a reference frame index t is randomly chosen. The remaining T 1 frame indices are then sampled within an interval of 20. The models are trained for 8 epochs, and decays the learning rate by 10 at 5th epoch. |