reproducibilityindex.ai

Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering

Authors: Long Hoang Dang, Thao Minh Le, Vuong Le, Truyen Tran

IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The method is evaluated on multiple major Video QA datasets and establishes new state-of-the-arts in these tasks. Analysis into the model s behavior indicates that object-oriented reasoning is a reliable, interpretable and efﬁcient approach to Video QA. ... 4 Experiments 4.1 Datasets 4.2 Comparison Against SOTAs 4.3 Ablation Studies
Researcher Affiliation	Academia	Long Hoang Dang , Thao Minh Le , Vuong Le and Truyen Tran Applied Artiﬁcial Intelligence Institute, Deakin University, Australia {hldang,lethao,vuong.le,truyen.tran}@deakin.edu.au
Pseudocode	No	The paper describes methods through text and diagrams (Figures 2 and 3), but does not contain structured pseudocode or algorithm blocks labeled as such.
Open Source Code	No	The paper mentions
Open Datasets	Yes	We evaluate our proposed HOSTR on the three public video QA benchmarks, namely, TGIF-QA [Jang et al., 2017], MSVD-QA [Xu et al., 2017] and MSRVTT-QA [Xu et al., 2017].
Dataset Splits	Yes	MSVD-QA consists of 50,505 QA pairs annotated from 1,970 short video clips. The dataset covers ﬁve question types: What, Who, How, When, and Where, of which 61% of the QA pairs for training, 13% for validation and 26% for testing. MSRVTT-QA contains 10K real videos (65% for training, 5% for validation, and 30% for testing)
Hardware Specification	No	The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies	No	The paper mentions using 'Faster R-CNN' and 'GloVe embedding' but does not provide specific version numbers for these or any other software dependencies, which are necessary for reproducibility.
Experiment Setup	Yes	The number of object sequences per video for MSVD-QA , MSRVTT-QA is 40 and TGIF-QA is 50. We embed question words into 300-D vectors and initialize them with Glo Ve during training. Default settings are with 6 GCN layers for each OSTR unit. The feature dimension d is set to be 512 in all sub-networks. We use the cross-entropy as the loss function to training the model from end to end for all tasks except counting, where Mean Square Error is used.