Towards Global Video Scene Segmentation with Context-Aware Transformer

Authors: Yang Yang, Yurui Huang, Weili Guo, Baohua Xu, Dingyin Xia

AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our empirical analyses show that CAT can achieve state-of-the-art performance when conducting the scene segmentation task on the Movie Net dataset, e.g., offering 2.15 improvements on AP. Experiments Experimental Setups Dataset. Considering the availability and scale of video segmentation datasets, we adopt the Movie Net dataset following all current state-of-the-art methods (Rao et al. 2020; Chen et al. 2021; Wu et al. 2022b; Mun et al. 2022).
Researcher Affiliation Collaboration Yang Yang1,2,3*, Yurui Huang1 , Weili Guo1, Baohua Xu4, Dingyin Xia4 1Nanjing University of Science and Technology 2MIIT Key Lab. of Pattern Analysis and Machine Intelligence, NUAA 3State Key Lab. for Novel Software Technology, NJU 4HUAWEI CBG Edu AI Lab
Pseudocode No The paper describes its methods in detail with textual explanations and figures, but it does not include a dedicated pseudocode or algorithm block.
Open Source Code Yes Code is available at https://github.com/njustkmg/CAT.
Open Datasets Yes Dataset. Considering the availability and scale of video segmentation datasets, we adopt the Movie Net dataset following all current state-of-the-art methods (Rao et al. 2020; Chen et al. 2021; Wu et al. 2022b; Mun et al. 2022). Movie Net (Huang et al. 2020) dataset published 1,100 movies where 318 of them are annotated with scene boundaries.
Dataset Splits Yes The whole annotation set is split into Train, Validation, and Test sets with the ratio of 10:2:3 on video level following (Huang et al. 2020), the scene boundaries are annotated at shot level.
Hardware Specification No The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory specifications) used to run the experiments.
Software Dependencies No The paper does not list specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, or specific library versions).
Experiment Setup Yes For CAT framework, we choose the 2-layer Transformer network with 8 heads, i.e., N = 8, as the encoder network architecture. ... For the pre-training stage, we cross-validate the number of neighbor shots among L = {1, 3, 5, 7}/P = {13, 15, 17, 19} and L = 5/P = 17 is selected due to its good performance and computational efficiency. The optimization method is Adaptive Moment Estimation (Adam), and the learning rate is searched in {0.5, 0.1, 0.05, 0.01, 0.005, 0.001} to find the best settings for each task. Finally, we set the learning rate as 0.001. The hyper-parameter µ = 0.3, τ = 0.1.