Learning to Learn Better for Video Object Segmentation
Authors: Meng Lan, Jing Zhang, Lefei Zhang, Dacheng Tao
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on public benchmarks show that our proposed LLB method achieves state-of-the-art performance. |
| Researcher Affiliation | Collaboration | 1 Institute of Artificial Intelligence and School of Computer Science, Wuhan University, China 2 The University of Sydney, Australia 3 JD Explore Academy, China 4 Hubei Luojia Laboratory, China |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/ViTAETransformer/VOS-LLB. |
| Open Datasets | Yes | Our proposed model is evaluated on three benchmark datasets, namely DAVIS 2017 val set (Pont-Tuset et al. 2017), You Tube-VOS val sets of 2018 and 2019 versions (Xu et al. 2018). [...] We first pre-train the model on the synthetic video sequence generated from static image dataset COCO (Lin et al. 2014) and then finetune the model on the DAVIS 2017 and You Tube-VOS 2019 training sets. |
| Dataset Splits | Yes | Our proposed model is evaluated on three benchmark datasets, namely DAVIS 2017 val set (Pont-Tuset et al. 2017), You Tube-VOS val sets of 2018 and 2019 versions (Xu et al. 2018). [...] The whole training process contains 140K iterations with a batch size of 32, where the first 50K iterations is the pre-training stage and the rest 90K iterations is the finetuning stage. [...] During training, the past frames and the current frame are extracted by the backbone to obtain high-level image features M RN H W C and X RH W C respectively, where N is the number of the past frames, H and W are the height and width, and C is the channel dimension. The DLGM takes the past frame-mask pairs as input and output two target encodings E1 and E2 for the two branches. |
| Hardware Specification | Yes | The model is trained on 4 Nvidia A100 GPUs and tested on a V100 GPU in Pytorch. |
| Software Dependencies | No | The paper mentions "Pytorch" as the framework but does not specify its version number or any other software dependencies with their respective version numbers. |
| Experiment Setup | Yes | The whole training process contains 140K iterations with a batch size of 32, where the first 50K iterations is the pre-training stage and the rest 90K iterations is the finetuning stage. [...] The features of stride 16 in both backbone and DLGM are selected as input of the two branches, in which the feature channels are first reduced from 1024 to 512 using an additional convolutional layer, thus the input channel and output channel in both branches are C = 512 and D = 32, respectively. [...] our few-shot learner employs 20 iterations in the first frame with a zero initialization τ 0 = 0 and 3 iterations in each subsequent sample in the memory to update τt = Aθ(Mt, τt 1). [...] each frame in the sequence is processed by first cropping a patch that is 5 times larger than the previous estimate of target, while ensuring the maximal size to be equal to the image itself, and then the cropped patch is resized to 832 480. |