Localizing Natural Language in Videos

Authors: Jingyuan Chen, Lin Ma, Xinpeng Chen, Zequn Jie, Jiebo Luo8175-8182

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments conducted on the public TACo S and Di De Mo datasets demonstrate that our proposed model performs effectively and efficiently against the state-of-the-art approaches.
Researcher Affiliation Collaboration Jingyuan Chen,1 Lin Ma,2 Xinpeng Chen,2 Zequn Jie,2 Jiebo Luo3 1Alibaba Group, 2Tencent AI Lab, 3University of Rochester
Pseudocode No The paper describes its methods using text and mathematical equations, but does not include structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement or link to the source code for their proposed L-Net model.
Open Datasets Yes We evaluate the proposed L-Net on two public video localization datasets (TACo S (Gao et al. 2017) and Di De Mo (Hendricks et al. 2017))... 1https://github.com/jiyanggao/TALL. 2https://github.com/Lisa Anne/Localizing Moments.
Dataset Splits Yes We follow the same split as in (Gao et al. 2017), which has 10146, 4589, and 4083 video-sentence pairs for training, validation, and testing respectively. ... We use the same split provided by (Hendricks et al. 2017) for a fair comparison, which has 33008, 4180, and 4022 video-sentence pairs for training, validation, and testing respectively.
Hardware Specification Yes All the experiments are conducted on a Tesla M40 GPU.
Software Dependencies No The paper mentions tools like Stanford Core NLP, GloVe (for word embeddings), and the Adam optimizer, but does not provide specific version numbers for software dependencies like Python, PyTorch/TensorFlow, or CUDA.
Experiment Setup Yes The hidden state dimension D of all layers (including the video, sentence, and interaction GRUs) are set to 75. The mini-batch size is set to 32 for TACo S and 64 for Di De Mo. We use the Adam (Kingma and Ba 2014) optimizer with β1 = 0.5 and β2 = 0.999. The initial learning rate is set to 0.001. We train the network for 200 iterations, and the learning rate is gradually decayed over time. We use bi-directional GRU of 3 layers to encode videos and sentences. Dropout (Srivastava et al. 2014) of rate 0.3 and 0.5 are utilized.