Video Question Answering on Screencast Tutorials
Authors: Wentian Zhao, Seokhwan Kim, Ning Xu, Hailin Jin
IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experimental results demonstrate that our proposed models significantly improve the question answering performances by incorporating multi-modal contexts and domain knowledge. |
| Researcher Affiliation | Industry | Wentian Zhao1 , Seokhwan Kim2 , Ning Xu1 and Hailin Jin1 1Adobe Research 2Amazon Alexa AI wezhao@adobe.com, seokhwk@amazon.com, {nxu, hljin}@adobe.com |
| Pseudocode | Yes | Algorithm 1 describes the details of our visual cue recognition method. |
| Open Source Code | No | The paper states: '1To download and learn more about our dataset, please see https://sites.google.com/view/pstuts-vqa/home .' This link is for the dataset, not for the source code of the methodology. No other explicit statement or link for code release is provided. |
| Open Datasets | Yes | To address the proposed task, we introduce a video question answering dataset1 collected from screencast tutorials for an image editing software. ... 1To download and learn more about our dataset, please see https://sites.google.com/view/pstuts-vqa/home . |
| Dataset Splits | Yes | Finally, we have 17,768 triples which were randomly divided into training, development, and test sets in Table 2. ... Videos QAs Set # videos lengths # sents # triples Train 54 238m 2,660 12,874 Dev 11 49m 519 2,524 Test 11 46m 485 2,370 Total 76 333m 3,664 17,768 |
| Hardware Specification | No | The paper does not provide specific hardware details such as exact GPU/CPU models, processor types, or memory amounts used for running experiments. |
| Software Dependencies | No | The paper mentions software like 'spa Cy', 'Kaldi', 'YOLO', 'Res Net', 'fast Text', 'Adam optimizer', 'Deep Walk', and 'skip gram', but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | All the models have the word embeddings initialized with the 300-dimensional pretrained fast Text [Bojanowski et al., 2017] vectors on Common Crawl dataset. The convolutional layer in the question and transcript encoders learned 100 maps for each of three different filter sizes {3, 4, 5}. And we set the hidden layer dimensions for GRU to 300. For the matching component, we used dot product as a scoring function. ... The models were trained with Adam optimizer [Kingma and Ba, 2014] by minimizing the negative log likelihood loss. For training, we used mini-batch size of 128 and applied dropout on every intermediate layer with the rate of 0.5 for regularization. |