reproducibilityindex.ai

Video Object of Interest Segmentation

Authors: Siyuan Zhou, Chunru Zhan, Biao Wang, Tiezheng Ge, Yuning Jiang, Li Niu

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on Live Videos dataset show the superiority of our proposed method.
Researcher Affiliation	Collaboration	1Mo E Key Lab of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai, China 2Alibaba Group, Beijing, China
Pseudocode	No	The paper describes its methods using prose and diagrams (e.g., Figure 2, Figure 3), but it does not include a dedicated pseudocode block or algorithm listing.
Open Source Code	No	The paper does not provide any statement about releasing source code or a link to a code repository for the described methodology.
Open Datasets	No	The paper introduces a new dataset called 'Live Videos' which they constructed, but it does not provide concrete access information such as a direct link, DOI, or specific repository where the dataset can be accessed.
Dataset Splits	Yes	We then randomly split the dataset into 1935 training samples and 483 test samples. Each sample is annotated with pixel-level segmentation masks and object labels of all the objects that are relevant to the corresponding target image.
Hardware Specification	Yes	The model is trained on 32 Tesla V100 GPUs with distributed parallel.
Software Dependencies	Yes	Experiments are conducted with Py Torch-1.7 (Paszke et al. 2019).
Experiment Setup	Yes	The dual-path Swin Transformer backbone is a fusion of 2D Swin Transformer (Liu et al. 2021b), 3D Swin Transformer (Liu et al. 2022) with temporal patch size modified to 1, and two Cross Transformer blocks. The tiny version of 2D/3D Swin Transformer is chosen due to GPU memory limitation. The initial token dimension C is 96, so the backbone output dimension is 8C = 768. The Transformer decoder follows the structure in DETR (Carion et al. 2020), which contains 6 decoder layers with the hidden dimension modified to 384. The Transformer decoder decodes n = 10 objects for each frame. We adopt Adam W (Loshchilov and Hutter 2017) optimizer with learning rate being 10 5 for the dual-path Swin Transformer backbone and 10 4 for the remaining parts. The model is trained with 18 epochs, where the learning rate decays by 10x after 12 epochs.