Multi-Speaker Video Dialog with Frame-Level Temporal Localization
Authors: Qiang Wang, Pin Jiang, Zhiyi Guo, Yahong Han, Zhou Zhao12200-12207
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach for both the Multi-Speaker Video Dialog without frame-level temporal localization (MSVD w/o TL) task and the MSVD-TL task. The experimental results further demonstrate that MSVD-TL enhances the applicability of video dialog in real life. |
| Researcher Affiliation | Academia | Qiang Wang,1 Pin Jiang,1 Zhiyi Guo,1 Yahong Han,1 Zhou Zhao2 1College of Intelligence and Computing, Tianjin University, Tianjin, China 2College of Computer Science, Zhejiang University, Hangzhou, China {qiangw, jpin, guo zhiyi, yahong}@tju.edu.cn, zhaozhou@zju.edu.cn |
| Pseudocode | No | No structured pseudocode or algorithm blocks were found in the paper. |
| Open Source Code | No | No explicit statement about the availability of open-source code for the described methodology or a link to a code repository was found. |
| Open Datasets | Yes | To evaluate this task, we extend the Twitch-FIFA dataset (Pasunuru and Bansal 2018) which provides collected soccer game videos along with multiple users live chat conversations about the game. |
| Dataset Splits | Yes | There are 49 game videos totally, which are divided into 33 videos for training, 8 videos for validation, and 8 videos for testing. And each video is several hours long, which provides a great amount of data. After processing, there are 10510 samples in the training set, 2153 samples in the validation set, and 2780 in the test set, respectively. |
| Hardware Specification | Yes | The experimental hardware environment is 1080ti GPU. |
| Software Dependencies | No | The paper mentions software components like Inception-v3, GloVe, LSTM, and Adam optimizer but does not provide specific version numbers for any of them (e.g., "GloVe (Pennington, Socher, and Manning 2014)", "All RNNs in our model are bidirectional single-layer Long short-term memory networks (LSTM)"). |
| Experiment Setup | Yes | During the training phase, the embedding size of words is 100. All RNNs in our model are bidirectional single-layer Long short-term memory networks (LSTM)... The size of hidden states in RNNs is 256. Therefore, the dimension of frame features for the video fragment, word features for the chat history, and word features for the following response are all 512. We rely on the Adam (Kingma and Ba 2014) algorithm to update all parameters in our model with the learning rate of 10 5. The experimental hardware environment is 1080ti GPU. During the training process, the batch size is set to 16 and the model is trained for 30,000 iterations. |