Weakly Supervised Video Moment Localization with Contrastive Negative Sample Mining
Authors: Minghang Zheng, Yanjie Huang, Qingchao Chen, Yang Liu3517-3525
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on two datasets show the effectiveness of our method. Code can be found at https://github.com/minghangz/cnm. Experiments Datasets To test the effectiveness of our proposed method, we perform experiments on two publicly available datasets, Activity Net Captions (Caba Heilbron et al. 2015; Krishna et al. 2017) and Charades-STA (Gao et al. 2017), respectively. |
| Researcher Affiliation | Academia | 1Wangxuan Institute of Computer Technology, Peking University 2National Institute of Health Data Science, Peking University 3Beijing Institute for General Artificial Intelligence |
| Pseudocode | No | The paper describes the method using mathematical equations and diagrams, but does not provide structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code can be found at https://github.com/minghangz/cnm. |
| Open Datasets | Yes | Activity Net Captions. Activity Net Captions dataset is released in (Krishna et al. 2017), which is made up of 19,290 videos with 37,417/17,505/17,031 moment of interests (Mo Is) in the train/val 1/val 2 split. Charades-STA. Charades-STA (Gao et al. 2017) dataset contains 12,408/3,720 video-query pairs. |
| Dataset Splits | Yes | Activity Net Captions. Activity Net Captions dataset is released in (Krishna et al. 2017), which is made up of 19,290 videos with 37,417/17,505/17,031 moment of interests (Mo Is) in the train/val 1/val 2 split. We adopt standard splits, and follow the common practice of the previous works SCN (Lin et al. 2020) and RTBPN (Zhang et al. 2020b), using the val 1 split for validation, val 2 split for testing. |
| Hardware Specification | Yes | On one NVIDIA TITAN X, we can achieve the speed of 55.8ms per video on the Activity Net Captions dataset, while the speed of SCN is 124ms per video. |
| Software Dependencies | No | The paper mentions using the Adam optimizer but does not provide specific version numbers for software dependencies such as programming languages, frameworks, or libraries. |
| Experiment Setup | Yes | Model Settings. For the transformer in the mask generator and mask conditioned reconstructor, the dimension of their hidden state is 256, the number of attention heads is 4, and the number of layers is 3. During training, we use Adam (Ellouz et al. 1974) optimizer with the learning rate set to 0.0004. The hyperparameters β1, β2 are set to 0.1, 0.15 respectively for both datasets. α is set to 5 for Activity Net captions and 5.5 for Charades-STA. Due to the shorter ground truth length on Charades-STA, we limit the maximum width of the prediction to 0.45 (multiplied by Eq. (3)). |