Video Question Answering on Screencast Tutorials

Authors: Wentian Zhao, Seokhwan Kim, Ning Xu, Hailin Jin

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experimental results demonstrate that our proposed models significantly improve the question answering performances by incorporating multi-modal contexts and domain knowledge.
Researcher Affiliation Industry Wentian Zhao1 , Seokhwan Kim2 , Ning Xu1 and Hailin Jin1 1Adobe Research 2Amazon Alexa AI wezhao@adobe.com, seokhwk@amazon.com, {nxu, hljin}@adobe.com
Pseudocode Yes Algorithm 1 describes the details of our visual cue recognition method.
Open Source Code No The paper states: '1To download and learn more about our dataset, please see https://sites.google.com/view/pstuts-vqa/home .' This link is for the dataset, not for the source code of the methodology. No other explicit statement or link for code release is provided.
Open Datasets Yes To address the proposed task, we introduce a video question answering dataset1 collected from screencast tutorials for an image editing software. ... 1To download and learn more about our dataset, please see https://sites.google.com/view/pstuts-vqa/home .
Dataset Splits Yes Finally, we have 17,768 triples which were randomly divided into training, development, and test sets in Table 2. ... Videos QAs Set # videos lengths # sents # triples Train 54 238m 2,660 12,874 Dev 11 49m 519 2,524 Test 11 46m 485 2,370 Total 76 333m 3,664 17,768
Hardware Specification No The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running experiments.
Software Dependencies No The paper mentions software like 'spa Cy', 'Kaldi', 'YOLO', 'Res Net', 'fast Text', 'Adam optimizer', 'Deep Walk', and 'skip gram', but does not provide specific version numbers for these dependencies.
Experiment Setup Yes All the models have the word embeddings initialized with the 300-dimensional pretrained fast Text [Bojanowski et al., 2017] vectors on Common Crawl dataset. The convolutional layer in the question and transcript encoders learned 100 maps for each of three different filter sizes {3, 4, 5}. And we set the hidden layer dimensions for GRU to 300. For the matching component, we used dot product as a scoring function. ... The models were trained with Adam optimizer [Kingma and Ba, 2014] by minimizing the negative log likelihood loss. For training, we used mini-batch size of 128 and applied dropout on every intermediate layer with the rate of 0.5 for regularization.