Semantic Conditioned Dynamic Modulation for Temporal Sentence Grounding in Videos

Authors: Yitian Yuan, Lin Ma, Jingwen Wang, Wei Liu, Wenwu Zhu

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on three public datasets demonstrate that our proposed model outperforms the state-of-the-arts with clear margins, illustrating the ability of SCDM to better associate and localize relevant video contents for temporal sentence grounding.
Researcher Affiliation Collaboration Yitian Yuan Tsinghua-Berkeley Shenzhen Institute Tsinghua University yyt18@mails.tsinghua.edu.cn Lin Ma Tencent AI Lab forest.linma@gmail.com Jingwen Wang Tencent AI Lab jaywongjaywong@gmail.com Wei Liu Tencent AI Lab wl2223@columbia.edu Wenwu Zhu Tsinghua University wwzhu@tsinghua.edu.cn
Pseudocode No The paper describes its method using mathematical equations and descriptions but does not provide structured pseudocode or an algorithm block.
Open Source Code Yes Our code for this paper is available at https://github.com/yytzsy/SCDM .
Open Datasets Yes We validate the performance of our proposed model on three public datasets for the TSG task: TACoS [24], Charades-STA [10], and Activity Net Captions [17].
Dataset Splits No The paper mentions 'training' and 'testing' but does not explicitly detail the partitioning of datasets into distinct training, validation, and test splits with specific percentages or sample counts.
Hardware Specification Yes The methods with released codes are run with one Nvidia TITAN XP GPU.
Software Dependencies No The paper mentions using specific features and models (C3D, I3D, GloVe, Bi-directional GRU) but does not provide specific version numbers for software dependencies or libraries.
Experiment Setup Yes For the design of temporal convolutional layers, 6 layers with {32, 16, 8, 4, 2, 1} temporal dimensions, 6 layers with {512, 256, 128, 64, 32, 16} temporal dimensions, and 8 layers with {512, 256, 128, 64, 32, 16, 8, 4} temporal dimensions are set for Charades-STA, TACoS, and Activity Net Captions, respectively. ... Hidden dimension of the sentence Bi-directional GRU, dimension of the multimodal fused features df, and the filter number dh for temporal convolution operations are all set as 512 in this paper. The trade-off parameters of the two loss terms λ and η are set as 100 and 10, respectively.