Localizing Natural Language in Videos
Authors: Jingyuan Chen, Lin Ma, Xinpeng Chen, Zequn Jie, Jiebo Luo8175-8182
AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments conducted on the public TACo S and Di De Mo datasets demonstrate that our proposed model performs effectively and efficiently against the state-of-the-art approaches. |
| Researcher Affiliation | Collaboration | Jingyuan Chen,1 Lin Ma,2 Xinpeng Chen,2 Zequn Jie,2 Jiebo Luo3 1Alibaba Group, 2Tencent AI Lab, 3University of Rochester |
| Pseudocode | No | The paper describes its methods using text and mathematical equations, but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide an explicit statement or link to the source code for their proposed L-Net model. |
| Open Datasets | Yes | We evaluate the proposed L-Net on two public video localization datasets (TACo S (Gao et al. 2017) and Di De Mo (Hendricks et al. 2017))... 1https://github.com/jiyanggao/TALL. 2https://github.com/Lisa Anne/Localizing Moments. |
| Dataset Splits | Yes | We follow the same split as in (Gao et al. 2017), which has 10146, 4589, and 4083 video-sentence pairs for training, validation, and testing respectively. ... We use the same split provided by (Hendricks et al. 2017) for a fair comparison, which has 33008, 4180, and 4022 video-sentence pairs for training, validation, and testing respectively. |
| Hardware Specification | Yes | All the experiments are conducted on a Tesla M40 GPU. |
| Software Dependencies | No | The paper mentions tools like Stanford Core NLP, GloVe (for word embeddings), and the Adam optimizer, but does not provide specific version numbers for software dependencies like Python, PyTorch/TensorFlow, or CUDA. |
| Experiment Setup | Yes | The hidden state dimension D of all layers (including the video, sentence, and interaction GRUs) are set to 75. The mini-batch size is set to 32 for TACo S and 64 for Di De Mo. We use the Adam (Kingma and Ba 2014) optimizer with β1 = 0.5 and β2 = 0.999. The initial learning rate is set to 0.001. We train the network for 200 iterations, and the learning rate is gradually decayed over time. We use bi-directional GRU of 3 layers to encode videos and sentences. Dropout (Srivastava et al. 2014) of rate 0.3 and 0.5 are utilized. |