reproducibilityindex.ai

Localizing Natural Language in Videos

Authors: Jingyuan Chen, Lin Ma, Xinpeng Chen, Zequn Jie, Jiebo Luo8175-8182

AAAI 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments conducted on the public TACo S and Di De Mo datasets demonstrate that our proposed model performs effectively and efﬁciently against the state-of-the-art approaches.
Researcher Affiliation	Collaboration	Jingyuan Chen,1 Lin Ma,2 Xinpeng Chen,2 Zequn Jie,2 Jiebo Luo3 1Alibaba Group, 2Tencent AI Lab, 3University of Rochester
Pseudocode	No	The paper describes its methods using text and mathematical equations, but does not include structured pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide an explicit statement or link to the source code for their proposed L-Net model.
Open Datasets	Yes	We evaluate the proposed L-Net on two public video localization datasets (TACo S (Gao et al. 2017) and Di De Mo (Hendricks et al. 2017))... 1https://github.com/jiyanggao/TALL. 2https://github.com/Lisa Anne/Localizing Moments.
Dataset Splits	Yes	We follow the same split as in (Gao et al. 2017), which has 10146, 4589, and 4083 video-sentence pairs for training, validation, and testing respectively. ... We use the same split provided by (Hendricks et al. 2017) for a fair comparison, which has 33008, 4180, and 4022 video-sentence pairs for training, validation, and testing respectively.
Hardware Specification	Yes	All the experiments are conducted on a Tesla M40 GPU.
Software Dependencies	No	The paper mentions tools like Stanford Core NLP, GloVe (for word embeddings), and the Adam optimizer, but does not provide specific version numbers for software dependencies like Python, PyTorch/TensorFlow, or CUDA.
Experiment Setup	Yes	The hidden state dimension D of all layers (including the video, sentence, and interaction GRUs) are set to 75. The mini-batch size is set to 32 for TACo S and 64 for Di De Mo. We use the Adam (Kingma and Ba 2014) optimizer with β1 = 0.5 and β2 = 0.999. The initial learning rate is set to 0.001. We train the network for 200 iterations, and the learning rate is gradually decayed over time. We use bi-directional GRU of 3 layers to encode videos and sentences. Dropout (Srivastava et al. 2014) of rate 0.3 and 0.5 are utilized.