You Only Infer Once: Cross-Modal Meta-Transfer for Referring Video Object Segmentation
Authors: Dezhuang Li, Ruoqi Li, Lijun Wang, Yifan Wang, Jinqing Qi, Lu Zhang, Ting Liu, Qingquan Xu, Huchuan Lu1297-1305
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on two popular RVOS benchmarks have verified the effectiveness of our method. We first perform an overall comparison with state-of-the-art methods on the RVOS benchmark datasets, followed by the ablative studies to verify our main contributions. |
| Researcher Affiliation | Collaboration | 1 Dalian University of Technology, Dalian, China 2 Meitu Inc., China {Merci, dutlrq77}@mail.dlut.edu.cn, {ljwang, wyfan, jinqing}@dlut.edu.cn, luzhangdut@gmail.com, {lt, xqq}@meitu.com, lhchuan@dlut.edu.cn |
| Pseudocode | No | The paper describes its modules with block diagrams (Figure 2, 3) and mathematical formulations, but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement or link regarding the public availability of its source code. |
| Open Datasets | Yes | We learn our method using the training sets of Refer-You Tube-VOS (Seo, Lee, and Han 2020), Refer DAVIS2017 (Khoreva, Rohrbach, and Schiele 2018), and Ref COCO (Nagaraja, Morariu, and Davis 2016). |
| Dataset Splits | Yes | Table 1 shows the comparison results on Refer-DAVIS2017 validation set. At each iteration, we randomly sample 4 frames within a temporal window size of 100 from a training video, serving as the input to the network. |
| Hardware Specification | Yes | The proposed method runs at 10 FPS per object on NVIDIA 1080TI GPU, which has a good trade-off between efficiency and accuracy. |
| Software Dependencies | No | The paper mentions several components like Res Net50, BERT model, Lovasz segmentation loss, and Adam optimizer, but does not specify their versions or the versions of underlying software frameworks (e.g., Python, PyTorch, TensorFlow). |
| Experiment Setup | Yes | We empirically set the hyper-parameter λ in (1) to 0.01. We set the memory size N to 3. At each iteration, we randomly sample 4 frames within a temporal window size of 100 from a training video... The whole network is end-to-end trained using the Lovasz segmentation loss... Adam optimizer... is adopted with a batch size of 4. We first train our network for 70 epochs... The default learning rate is 2e-4 which decays by 0.2 in the 40th epoch. Then the whole network is jointly trained for another 80 epochs. The default learning rate here is 2e-5 which decays by 0.2 in the 25th, 75th epoch. |