Learning to Learn Better for Video Object Segmentation

Authors: Meng Lan, Jing Zhang, Lefei Zhang, Dacheng Tao

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on public benchmarks show that our proposed LLB method achieves state-of-the-art performance.
Researcher Affiliation Collaboration 1 Institute of Artificial Intelligence and School of Computer Science, Wuhan University, China 2 The University of Sydney, Australia 3 JD Explore Academy, China 4 Hubei Luojia Laboratory, China
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/ViTAETransformer/VOS-LLB.
Open Datasets Yes Our proposed model is evaluated on three benchmark datasets, namely DAVIS 2017 val set (Pont-Tuset et al. 2017), You Tube-VOS val sets of 2018 and 2019 versions (Xu et al. 2018). [...] We first pre-train the model on the synthetic video sequence generated from static image dataset COCO (Lin et al. 2014) and then finetune the model on the DAVIS 2017 and You Tube-VOS 2019 training sets.
Dataset Splits Yes Our proposed model is evaluated on three benchmark datasets, namely DAVIS 2017 val set (Pont-Tuset et al. 2017), You Tube-VOS val sets of 2018 and 2019 versions (Xu et al. 2018). [...] The whole training process contains 140K iterations with a batch size of 32, where the first 50K iterations is the pre-training stage and the rest 90K iterations is the finetuning stage. [...] During training, the past frames and the current frame are extracted by the backbone to obtain high-level image features M RN H W C and X RH W C respectively, where N is the number of the past frames, H and W are the height and width, and C is the channel dimension. The DLGM takes the past frame-mask pairs as input and output two target encodings E1 and E2 for the two branches.
Hardware Specification Yes The model is trained on 4 Nvidia A100 GPUs and tested on a V100 GPU in Pytorch.
Software Dependencies No The paper mentions "Pytorch" as the framework but does not specify its version number or any other software dependencies with their respective version numbers.
Experiment Setup Yes The whole training process contains 140K iterations with a batch size of 32, where the first 50K iterations is the pre-training stage and the rest 90K iterations is the finetuning stage. [...] The features of stride 16 in both backbone and DLGM are selected as input of the two branches, in which the feature channels are first reduced from 1024 to 512 using an additional convolutional layer, thus the input channel and output channel in both branches are C = 512 and D = 32, respectively. [...] our few-shot learner employs 20 iterations in the first frame with a zero initialization τ 0 = 0 and 3 iterations in each subsequent sample in the memory to update τt = Aθ(Mt, τt 1). [...] each frame in the sequence is processed by first cropping a patch that is 5 times larger than the previous estimate of target, while ensuring the maximal size to be equal to the image itself, and then the cropped patch is resized to 832 480.