Weakly-Supervised Video Moment Retrieval via Semantic Completion Network

Authors: Zhijie Lin, Zhou Zhao, Zhu Zhang, Qi Wang, Huasheng Liu11539-11546

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on the Activity Captions and Charades-STA demonstrate the effectiveness of our proposed method.
Researcher Affiliation Collaboration Zhijie Lin,1 Zhou Zhao,1 Zhu Zhang,1 Qi Wang,2 Huasheng Liu2 1College of Computer Science, Zhejiang University, Hangzhou, China, 2Alibaba Inc., China
Pseudocode No The paper describes algorithms in text and diagrams but does not include formal pseudocode or algorithm blocks.
Open Source Code No The paper does not provide any link or statement about open-sourcing its code.
Open Datasets Yes Experiments on the Activity Captions and Charades-STA demonstrate the effectiveness of our proposed method. ... Activity Captions. The Activity Captions (Caba Heilbron et al. 2015) dataset ... Charades-STA. The Charades-STA dataset is released in (Gao et al. 2017) for moment retrieval...
Dataset Splits Yes The released Activity Captions dataset comprise 17,031 description-moment pairs for training. Since the caption annotations of test data of Activity Captions are not publically available, we take the val 1 as the validation set and val 2 as test data.
Hardware Specification No The paper does not provide specific details about the hardware used, such as GPU/CPU models or memory.
Software Dependencies No The paper mentions software components like 'word2vec', 'NLTK', 'pretrained Glove', 'Adam optimizer', and 'Transformer' but does not specify their version numbers.
Experiment Setup Yes Model Settings. At each time step of video, we score nk candidate proposals of multiple scales. We set nk to 6 with ratios of [0.167, 0.333, 0.500, 0.667, 0.834, 1.0] for Activity Captions, and to 4 with ratios of [0.167, 0.250, 0.333, 0.500] for Charades-STA. We then set the decay hyperparameter λ1 to 0.5, λ2 to 2000, the number of selected proposals K to 4, the balance hyper-parameter β to 0.1. Also, we mask one-third of words in a sentence and replace with a special token for semantic completion. Note that noun and verb are more likely to be masked. Moreover, for Transformer Encoder as well as Transformer Decoder, the dimension of hidden state is set to 256 and the number of layers is set to 3. During training, we adopt the Adam optimizer with learning rate 0.0002 to minimize the multi-task loss. The learning rate increases linearly to the maximum with a warm-up step of 400 and then decreases itself based on the number of updates (Vaswani et al. 2017).