Memory-Guided Semantic Learning Network for Temporal Sentence Grounding

Authors: Daizong Liu, Xiaoye Qu, Xing Di, Yu Cheng, Zichuan Xu, Pan Zhou1665-1673

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on three benchmarks show the superiority of our method on both effectiveness and efficiency, which substantially improves the accuracy not only on the entire dataset but also on rare cases.
Researcher Affiliation Collaboration 1The Hubei Engineering Research Center on Big Data Security, School of Cyber Science and Engineering, Huazhong University of Science and Technology 2School of Electronic Information and Communication, Huazhong University of Science and Technology 3Protago Labs Inc 4Microsoft Research 5Dalian University of Technology {dzliu, xiaoye, panzhou}@hust.edu.cn, xing.di@protagolabs.com, yu.cheng@microsoft.com, z.xu@dlut.edu.cn
Pseudocode No The paper describes its method in prose and mathematical equations but does not include explicit pseudocode or algorithm blocks.
Open Source Code No The paper does not provide explicit statements or links for open-source code.
Open Datasets Yes Activity Net Caption. Activity Net Caption (Krishna et al. 2017) contains 20000 untrimmed videos... TACo S. TACo S (Regneri et al. 2013) is widely used on TSG task... Charades-STA. Charades-STA is built on the Charades dataset (Sigurdsson et al. 2016)...
Dataset Splits Yes Activity Net Caption. Activity Net Caption (Krishna et al. 2017) contains 20000 untrimmed videos with 100000 descriptions from You Tube. The videos are 2 minutes on average, and the annotated video clips have much larger variation, ranging from several seconds to over 3 minutes. Following public split, we use 37,417, 17,505, and 17,031 sentence-video pairs for training, validation, and testing respectively. TACo S. TACo S (Regneri et al. 2013) is widely used on TSG task and contain 127 videos. The videos from TACo S are collected from cooking scenarios, thus lacking the diversity. They are around 7 minutes on average. We use the same split as (Gao et al. 2017), which includes 10146, 4589, 4083 query-segment pairs for training, validation and testing.
Hardware Specification Yes All the experiments are implemented on a single NVIDIA TITAN XP GPU.
Software Dependencies No We utilize the 112 × 112 pixels shape of every frame of videos as input, and apply C3D (Tran et al. 2015) to encode the videos on Activity Net Caption, TACo S, and I3D (Carreira and Zisserman 2017) on Charades-STA. As for sentence encoding, we set the length of word feature sequences to 20, and utilize Glove embedding (Pennington, Socher, and Manning 2014) to embed each word to 300 dimension features. The hidden state dimension of Bi LSTM networks is set to 512. ... During the training, we use an Adam optimizer with the leaning rate of 0.0001.
Experiment Setup Yes Implementation Details We utilize the 112 × 112 pixels shape of every frame of videos as input, and apply C3D (Tran et al. 2015) to encode the videos on Activity Net Caption, TACo S, and I3D (Carreira and Zisserman 2017) on Charades-STA. We set the length of video feature sequences to 200 for Activity Net Caption and TACo S datasets, 64 for Charades-STA dataset. As for sentence encoding, we set the length of word feature sequences to 20, and utilize Glove embedding (Pennington, Socher, and Manning 2014) to embed each word to 300 dimension features. The hidden state dimension of Bi LSTM networks is set to 512. The number of memory items (LV , LQ) are set to (1024,1024), (512,512), (512,512) for three datasets, respectively. We empirically find that further increasing the memory number results in a convergence of the performance. The balanced weights of L are λ1 = λ2 = λ3 = 1.0. During the training, we use an Adam optimizer with the leaning rate of 0.0001. The model is trained for 50 epochs to guarantee its convergence with a batch size of 128.