Weakly Supervised Video Moment Localization with Contrastive Negative Sample Mining

Authors: Minghang Zheng, Yanjie Huang, Qingchao Chen, Yang Liu3517-3525

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on two datasets show the effectiveness of our method. Code can be found at https://github.com/minghangz/cnm. Experiments Datasets To test the effectiveness of our proposed method, we perform experiments on two publicly available datasets, Activity Net Captions (Caba Heilbron et al. 2015; Krishna et al. 2017) and Charades-STA (Gao et al. 2017), respectively.
Researcher Affiliation Academia 1Wangxuan Institute of Computer Technology, Peking University 2National Institute of Health Data Science, Peking University 3Beijing Institute for General Artificial Intelligence
Pseudocode No The paper describes the method using mathematical equations and diagrams, but does not provide structured pseudocode or algorithm blocks.
Open Source Code Yes Code can be found at https://github.com/minghangz/cnm.
Open Datasets Yes Activity Net Captions. Activity Net Captions dataset is released in (Krishna et al. 2017), which is made up of 19,290 videos with 37,417/17,505/17,031 moment of interests (Mo Is) in the train/val 1/val 2 split. Charades-STA. Charades-STA (Gao et al. 2017) dataset contains 12,408/3,720 video-query pairs.
Dataset Splits Yes Activity Net Captions. Activity Net Captions dataset is released in (Krishna et al. 2017), which is made up of 19,290 videos with 37,417/17,505/17,031 moment of interests (Mo Is) in the train/val 1/val 2 split. We adopt standard splits, and follow the common practice of the previous works SCN (Lin et al. 2020) and RTBPN (Zhang et al. 2020b), using the val 1 split for validation, val 2 split for testing.
Hardware Specification Yes On one NVIDIA TITAN X, we can achieve the speed of 55.8ms per video on the Activity Net Captions dataset, while the speed of SCN is 124ms per video.
Software Dependencies No The paper mentions using the Adam optimizer but does not provide specific version numbers for software dependencies such as programming languages, frameworks, or libraries.
Experiment Setup Yes Model Settings. For the transformer in the mask generator and mask conditioned reconstructor, the dimension of their hidden state is 256, the number of attention heads is 4, and the number of layers is 3. During training, we use Adam (Ellouz et al. 1974) optimizer with the learning rate set to 0.0004. The hyperparameters β1, β2 are set to 0.1, 0.15 respectively for both datasets. α is set to 5 for Activity Net captions and 5.5 for Charades-STA. Due to the shorter ground truth length on Charades-STA, we limit the maximum width of the prediction to 0.45 (multiplied by Eq. (3)).