Unsupervised Temporal Video Grounding with Deep Semantic Clustering

Authors: Daizong Liu, Xiaoye Qu, Yinzhen Wang, Xing Di, Kai Zou, Yu Cheng, Zichuan Xu, Pan Zhou1683-1691

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To validate the effectiveness of our DSCNet, we conduct experiments on both Activity Net Captions and Charades-STA datasets. The results demonstrate that DSCNet achieves competitive performance, and even outperforms most weakly-supervised approaches.
Researcher Affiliation Collaboration Daizong Liu1,2 , Xiaoye Qu2 , Yinzhen Wang3 , Xing Di4, Kai Zou4, Yu Cheng5, Zichuan Xu6, Pan Zhou1* 1The Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science and Technology 2School of Electronic Information and Communication, Huazhong University of Science and Technology 3School of Computer Science and Technology, Huazhong University of Science and Technology 4Protago Labs Inc 5Microsoft Research 6Dalian University of Technology
Pseudocode Yes Algorithm 1 Iterative learning process of video module Input: All semantic cluster centers C of the whole query set; video feature F . 1: Init pseudo label based on C and F 2: for iteration l 1 to L do 3: for neck i 1 to Ne do 4: Execute specific attention branch with Ci = {ci j}Nc j=1 to obtain ˆFv = { ˆf t v}T t=1 and Aspe; 5: Execute foreground attention branch to obtain Afore; 6: Generate the training samples by pseudo labels, and calculate the overall loss Lv for backpropagation; 7: Generate the new feature ˆFv, and utilize it to update the pseudo labels; 8: end 9: end
Open Source Code No The paper does not provide an explicit statement about releasing its source code or a link to a code repository.
Open Datasets Yes To validate the effectiveness of our DSCNet, we conduct experiments on both Activity Net Captions and Charades-STA datasets. The Activity Net Captions dataset is built from Activity Net v1.3 dataset (Caba Heilbron et al. 2015) for dense video captioning. The Charades-STA dataset is built from the Charades (Sigurdsson et al. 2016) dataset and transformed into video temporal grounding task by (Gao et al. 2017).
Dataset Splits Yes Activity Net Captions. This dataset is built from Activity Net v1.3 dataset (Caba Heilbron et al. 2015) for dense video captioning. It contains 20000 You Tube videos with 100000 queries. We follow the public split of the dataset that contains a training set and two validation sets val 1 and val 2.
Hardware Specification No The paper does not provide specific hardware details such as GPU or CPU models used for experiments.
Software Dependencies No The paper mentions 'Glove model' and 'C3D network' as well as 'Adam optimizer', but does not specify software dependencies with version numbers.
Experiment Setup Yes In order to make a fair comparison with previous works, we utilize C3D to extract video features and Glove to obtain word embeddings. As some videos are too long, we set the length of video feature sequences to 128 for Charades-STA and 256 for Activity Net Captions. We fix the query length to 10 in Charades-STA and 20 in Activity Net Captions. We set neck number Ne to 4 for Charades-STA and 8 for Activity Net Captions, and set cluster number Nc to 16. The LSTM Layers in language encoder and decoder are both 2 layers architecture with 512 hidden size. The dimension of joint embedding space de is set to 1024. We utilize Adam optimizer with the initial learning rate as 0.0001 for language module and 0.0005 for video module. The hyper-parameters θ, τ1, τ2, τ3 are set as 1.0, 0.0001, 0.0001, 0.5. λ is set to 0.5. And αw, βw, αv, βv in Eq. (3) and (15) are all set as 0.5. The inference threshold is set to 0.8 in Activity Net and 0.9 in Charades-STA.