Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos
Authors: Yitian Yuan, Lin Ma, Jingwen Wang, Wei Liu, Wenwu Zhu
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on three public datasets demonstrate that our proposed model outperforms the state-of-the-arts with clear margins, illustrating the ability of SCDM to better associate and localize relevant video contents for temporal sentence grounding. |
| Researcher Affiliation | Collaboration | Yitian Yuan Tsinghua-Berkeley Shenzhen Institute Tsinghua University yyt18@mails.tsinghua.edu.cn Lin Ma Tencent AI Lab forest.linma@gmail.com Jingwen Wang Tencent AI Lab jaywongjaywong@gmail.com Wei Liu Tencent AI Lab wl2223@columbia.edu Wenwu Zhu Tsinghua University wwzhu@tsinghua.edu.cn |
| Pseudocode | No | The paper describes its method using mathematical equations and descriptions but does not provide structured pseudocode or an algorithm block. |
| Open Source Code | Yes | Our code for this paper is available at https://github.com/yytzsy/SCDM . |
| Open Datasets | Yes | We validate the performance of our proposed model on three public datasets for the TSG task: TACoS [24], Charades-STA [10], and Activity Net Captions [17]. |
| Dataset Splits | No | The paper mentions 'training' and 'testing' but does not explicitly detail the partitioning of datasets into distinct training, validation, and test splits with specific percentages or sample counts. |
| Hardware Specification | Yes | The methods with released codes are run with one Nvidia TITAN XP GPU. |
| Software Dependencies | No | The paper mentions using specific features and models (C3D, I3D, GloVe, Bi-directional GRU) but does not provide specific version numbers for software dependencies or libraries. |
| Experiment Setup | Yes | For the design of temporal convolutional layers, 6 layers with {32, 16, 8, 4, 2, 1} temporal dimensions, 6 layers with {512, 256, 128, 64, 32, 16} temporal dimensions, and 8 layers with {512, 256, 128, 64, 32, 16, 8, 4} temporal dimensions are set for Charades-STA, TACoS, and Activity Net Captions, respectively. ... Hidden dimension of the sentence Bi-directional GRU, dimension of the multimodal fused features df, and the filter number dh for temporal convolution operations are all set as 512 in this paper. The trade-off parameters of the two loss terms λ and η are set as 100 and 10, respectively. |