Video Object of Interest Segmentation
Authors: Siyuan Zhou, Chunru Zhan, Biao Wang, Tiezheng Ge, Yuning Jiang, Li Niu
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on Live Videos dataset show the superiority of our proposed method. |
| Researcher Affiliation | Collaboration | 1Mo E Key Lab of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai, China 2Alibaba Group, Beijing, China |
| Pseudocode | No | The paper describes its methods using prose and diagrams (e.g., Figure 2, Figure 3), but it does not include a dedicated pseudocode block or algorithm listing. |
| Open Source Code | No | The paper does not provide any statement about releasing source code or a link to a code repository for the described methodology. |
| Open Datasets | No | The paper introduces a new dataset called 'Live Videos' which they constructed, but it does not provide concrete access information such as a direct link, DOI, or specific repository where the dataset can be accessed. |
| Dataset Splits | Yes | We then randomly split the dataset into 1935 training samples and 483 test samples. Each sample is annotated with pixel-level segmentation masks and object labels of all the objects that are relevant to the corresponding target image. |
| Hardware Specification | Yes | The model is trained on 32 Tesla V100 GPUs with distributed parallel. |
| Software Dependencies | Yes | Experiments are conducted with Py Torch-1.7 (Paszke et al. 2019). |
| Experiment Setup | Yes | The dual-path Swin Transformer backbone is a fusion of 2D Swin Transformer (Liu et al. 2021b), 3D Swin Transformer (Liu et al. 2022) with temporal patch size modified to 1, and two Cross Transformer blocks. The tiny version of 2D/3D Swin Transformer is chosen due to GPU memory limitation. The initial token dimension C is 96, so the backbone output dimension is 8C = 768. The Transformer decoder follows the structure in DETR (Carion et al. 2020), which contains 6 decoder layers with the hidden dimension modified to 384. The Transformer decoder decodes n = 10 objects for each frame. We adopt Adam W (Loshchilov and Hutter 2017) optimizer with learning rate being 10 5 for the dual-path Swin Transformer backbone and 10 4 for the remaining parts. The model is trained with 18 epochs, where the learning rate decays by 10x after 12 epochs. |