Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering

Authors: Long Hoang Dang, Thao Minh Le, Vuong Le, Truyen Tran

IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The method is evaluated on multiple major Video QA datasets and establishes new state-of-the-arts in these tasks. Analysis into the model s behavior indicates that object-oriented reasoning is a reliable, interpretable and efficient approach to Video QA. ... 4 Experiments 4.1 Datasets 4.2 Comparison Against SOTAs 4.3 Ablation Studies
Researcher Affiliation Academia Long Hoang Dang , Thao Minh Le , Vuong Le and Truyen Tran Applied Artificial Intelligence Institute, Deakin University, Australia {hldang,lethao,vuong.le,truyen.tran}@deakin.edu.au
Pseudocode No The paper describes methods through text and diagrams (Figures 2 and 3), but does not contain structured pseudocode or algorithm blocks labeled as such.
Open Source Code No The paper mentions
Open Datasets Yes We evaluate our proposed HOSTR on the three public video QA benchmarks, namely, TGIF-QA [Jang et al., 2017], MSVD-QA [Xu et al., 2017] and MSRVTT-QA [Xu et al., 2017].
Dataset Splits Yes MSVD-QA consists of 50,505 QA pairs annotated from 1,970 short video clips. The dataset covers five question types: What, Who, How, When, and Where, of which 61% of the QA pairs for training, 13% for validation and 26% for testing. MSRVTT-QA contains 10K real videos (65% for training, 5% for validation, and 30% for testing)
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions using 'Faster R-CNN' and 'GloVe embedding' but does not provide specific version numbers for these or any other software dependencies, which are necessary for reproducibility.
Experiment Setup Yes The number of object sequences per video for MSVD-QA , MSRVTT-QA is 40 and TGIF-QA is 50. We embed question words into 300-D vectors and initialize them with Glo Ve during training. Default settings are with 6 GCN layers for each OSTR unit. The feature dimension d is set to be 512 in all sub-networks. We use the cross-entropy as the loss function to training the model from end to end for all tasks except counting, where Mean Square Error is used.