TMT-VIS: Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation
Authors: Rongkun Zheng, Lu Qi, Xi Chen, Yi Wang, Kun Wang, Yu Qiao, Hengshuang Zhao
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experimental evaluations on four popular and challenging benchmarks, including You Tube-VIS 2019, You Tube VIS 2021, OVIS, and UVO. Our model shows significant improvement over the baseline solutions, and sets new state-of-the-art records on all benchmarks. |
| Researcher Affiliation | Collaboration | 1The University of Hong Kong 2University of California, Merced 3Shanghai Artificial Intelligence Laboratory 4Sense Time Research |
| Pseudocode | No | The paper describes its methodology in text and uses figures to illustrate the framework, but it does not include formal pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is available at https://github.com/rkzheng99/TMT-VIS. |
| Open Datasets | Yes | We conduct extensive experimental evaluations on four popular and challenging benchmarks, including You Tube-VIS 2019 and 2021 [47], OVIS [35], and UVO [39]. |
| Dataset Splits | Yes | You Tube-VIS 2019 [47] is the first large-scale dataset for video instance segmentation, with 2.9K videos averaging 4.61s in duration and 27.4 frames in validation videos. You Tube-VIS 2021 [47] includes more challenging longer videos with more complex trajectories, resulting in an average of 39.7 frames in validation videos. Table 3: Ablation study on training with multiple VIS datasets with Mask2Former-VIS (which is abbreviated as M2F ) and TMT-VIS and their validation results on various VIS datasets. |
| Hardware Specification | No | The paper mentions model backbones (e.g., Res Net-50, Swin-L) but does not provide specific details about the hardware (GPU, CPU models, memory, etc.) used for running the experiments. |
| Software Dependencies | No | Our method is implemented on top of detectron2 [46]. |
| Experiment Setup | Yes | Our method is implemented on top of detectron2 [46]. Hyper-parameters regarding the pixel decoder and transformer decoder are consistent with the settings of Mask2Former-VIS [7]. In the Taxonomy Compilation Module, the size of the taxonomy embedding set NT is set to 10, which matches the maximum instance number per video. ... we set λcls = 2.0 and λtaxo = 0.5. ... During inference, we resize the shorter side of each frame to 360 pixels for Res Net [14] backbones and 480 pixels for Swin [29] backbones. |