reproducibilityindex.ai

Video Question Answering on Screencast Tutorials

Authors: Wentian Zhao, Seokhwan Kim, Ning Xu, Hailin Jin

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experimental results demonstrate that our proposed models signiﬁcantly improve the question answering performances by incorporating multi-modal contexts and domain knowledge.
Researcher Affiliation	Industry	Wentian Zhao1 , Seokhwan Kim2 , Ning Xu1 and Hailin Jin1 1Adobe Research 2Amazon Alexa AI wezhao@adobe.com, seokhwk@amazon.com, {nxu, hljin}@adobe.com
Pseudocode	Yes	Algorithm 1 describes the details of our visual cue recognition method.
Open Source Code	No	The paper states: '1To download and learn more about our dataset, please see https://sites.google.com/view/pstuts-vqa/home .' This link is for the dataset, not for the source code of the methodology. No other explicit statement or link for code release is provided.
Open Datasets	Yes	To address the proposed task, we introduce a video question answering dataset1 collected from screencast tutorials for an image editing software. ... 1To download and learn more about our dataset, please see https://sites.google.com/view/pstuts-vqa/home .
Dataset Splits	Yes	Finally, we have 17,768 triples which were randomly divided into training, development, and test sets in Table 2. ... Videos QAs Set # videos lengths # sents # triples Train 54 238m 2,660 12,874 Dev 11 49m 519 2,524 Test 11 46m 485 2,370 Total 76 333m 3,664 17,768
Hardware Specification	No	The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running experiments.
Software Dependencies	No	The paper mentions software like 'spa Cy', 'Kaldi', 'YOLO', 'Res Net', 'fast Text', 'Adam optimizer', 'Deep Walk', and 'skip gram', but does not provide specific version numbers for these dependencies.
Experiment Setup	Yes	All the models have the word embeddings initialized with the 300-dimensional pretrained fast Text [Bojanowski et al., 2017] vectors on Common Crawl dataset. The convolutional layer in the question and transcript encoders learned 100 maps for each of three different ﬁlter sizes {3, 4, 5}. And we set the hidden layer dimensions for GRU to 300. For the matching component, we used dot product as a scoring function. ... The models were trained with Adam optimizer [Kingma and Ba, 2014] by minimizing the negative log likelihood loss. For training, we used mini-batch size of 128 and applied dropout on every intermediate layer with the rate of 0.5 for regularization.