reproducibilityindex.ai

Multi-Speaker Video Dialog with Frame-Level Temporal Localization

Authors: Qiang Wang, Pin Jiang, Zhiyi Guo, Yahong Han, Zhou Zhao12200-12207

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our approach for both the Multi-Speaker Video Dialog without frame-level temporal localization (MSVD w/o TL) task and the MSVD-TL task. The experimental results further demonstrate that MSVD-TL enhances the applicability of video dialog in real life.
Researcher Affiliation	Academia	Qiang Wang,1 Pin Jiang,1 Zhiyi Guo,1 Yahong Han,1 Zhou Zhao2 1College of Intelligence and Computing, Tianjin University, Tianjin, China 2College of Computer Science, Zhejiang University, Hangzhou, China {qiangw, jpin, guo zhiyi, yahong}@tju.edu.cn, zhaozhou@zju.edu.cn
Pseudocode	No	No structured pseudocode or algorithm blocks were found in the paper.
Open Source Code	No	No explicit statement about the availability of open-source code for the described methodology or a link to a code repository was found.
Open Datasets	Yes	To evaluate this task, we extend the Twitch-FIFA dataset (Pasunuru and Bansal 2018) which provides collected soccer game videos along with multiple users live chat conversations about the game.
Dataset Splits	Yes	There are 49 game videos totally, which are divided into 33 videos for training, 8 videos for validation, and 8 videos for testing. And each video is several hours long, which provides a great amount of data. After processing, there are 10510 samples in the training set, 2153 samples in the validation set, and 2780 in the test set, respectively.
Hardware Specification	Yes	The experimental hardware environment is 1080ti GPU.
Software Dependencies	No	The paper mentions software components like Inception-v3, GloVe, LSTM, and Adam optimizer but does not provide specific version numbers for any of them (e.g., "GloVe (Pennington, Socher, and Manning 2014)", "All RNNs in our model are bidirectional single-layer Long short-term memory networks (LSTM)").
Experiment Setup	Yes	During the training phase, the embedding size of words is 100. All RNNs in our model are bidirectional single-layer Long short-term memory networks (LSTM)... The size of hidden states in RNNs is 256. Therefore, the dimension of frame features for the video fragment, word features for the chat history, and word features for the following response are all 512. We rely on the Adam (Kingma and Ba 2014) algorithm to update all parameters in our model with the learning rate of 10 5. The experimental hardware environment is 1080ti GPU. During the training process, the batch size is set to 16 and the model is trained for 30,000 iterations.