Unsupervised Domain Adaptative Temporal Sentence Localization with Mutual Information Maximization

Authors: Daizong Liu, Xiang Fang, Xiaoye Qu, Jianfeng Dong, He Yan, Yang Yang, Pan Zhou, Yu Cheng

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Three sets of migration experiments show that our model achieves competitive performance compared to existing methods. ... We conduct the UDA experiments on three widely used TSL datasets (Activity Net Captions, Charades-STA, and TACo S). Extensive results show that our proposed model performs much better than existing approaches.
Researcher Affiliation Collaboration 1Wangxuan Institute of Computer Technology, Peking University 2Nanyang Technological University 3Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science of Technology 4College of Computer Science and Technology, Zhejiang Gongshang University 5Protagolabs Inc. 6Meta Platforms Inc. 7Department of Computer Science and Engineering, The Chinese University of Hong Kong
Pseudocode No The paper describes algorithms and procedures within the text and using equations, but does not include any clearly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code No The paper does not contain an explicit statement about the release of source code or a link to a code repository for the described methodology.
Open Datasets Yes For fair comparison with existing TSL works, we utilize the same Activity Net Caption (Caba Heilbron et al. 2015), TACo S (Regneri et al. 2013), and Charades-STA (Sigurdsson et al. 2016) datasets for evaluation.
Dataset Splits Yes Specifically, Activity Net Caption contains 20000 untrimmed videos with 100000 descriptions from You Tube. Following public split, we use 37417, 17505, and 17031 sentence-video pairs for training, validation, and testing. TACo S contains 127 videos collected from cooking scenarios. We also follow the public split, which includes 10146, 4589, 4083 query-segment pairs for training, validation and testing. As for Charades STA, there are 12408 and 3720 moment-query pairs in the training and testing sets, respectively.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU, GPU models, or memory) used for running the experiments.
Software Dependencies No The paper mentions using a 'pre-trained C3D network (Tran et al. 2015)', 'VGG (Simonyan and Zisserman 2014) model', 'Glove (Pennington, Socher, and Manning 2014) embedding', and 'Adam optimizer', but it does not specify version numbers for any of these software components or libraries.
Experiment Setup Yes As for video encoding, following previous works (Zhang et al. 2020b; Wang et al. 2022), we apply the pre-trained C3D (Tran et al. 2015) model to encode the videos on Activity Net Caption, TACo S, and VGG (Simonyan and Zisserman 2014) model on Charades-STA. ... we uniformly downsample the length of video feature sequences to Nv = 200 for Activity Net Caption and TACo S datasets, Nv = 64 for Charades-STA dataset. As for sentence encoding, we set the length of word feature sequences to Nq = 20, and utilize Glove embedding (Pennington, Socher, and Manning 2014) to embed each word to 300 dimension features. The dimension d is set to 512. We train our model for 100 epochs with an early stopping strategy. Parameter optimization is performed by Adam optimizer with learning rate of 0.0005, linear decay rate of 1.0.