Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering
Authors: Long Hoang Dang, Thao Minh Le, Vuong Le, Truyen Tran
IJCAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The method is evaluated on multiple major Video QA datasets and establishes new state-of-the-arts in these tasks. Analysis into the model s behavior indicates that object-oriented reasoning is a reliable, interpretable and efficient approach to Video QA. ... 4 Experiments 4.1 Datasets 4.2 Comparison Against SOTAs 4.3 Ablation Studies |
| Researcher Affiliation | Academia | Long Hoang Dang , Thao Minh Le , Vuong Le and Truyen Tran Applied Artificial Intelligence Institute, Deakin University, Australia {hldang,lethao,vuong.le,truyen.tran}@deakin.edu.au |
| Pseudocode | No | The paper describes methods through text and diagrams (Figures 2 and 3), but does not contain structured pseudocode or algorithm blocks labeled as such. |
| Open Source Code | No | The paper mentions |
| Open Datasets | Yes | We evaluate our proposed HOSTR on the three public video QA benchmarks, namely, TGIF-QA [Jang et al., 2017], MSVD-QA [Xu et al., 2017] and MSRVTT-QA [Xu et al., 2017]. |
| Dataset Splits | Yes | MSVD-QA consists of 50,505 QA pairs annotated from 1,970 short video clips. The dataset covers five question types: What, Who, How, When, and Where, of which 61% of the QA pairs for training, 13% for validation and 26% for testing. MSRVTT-QA contains 10K real videos (65% for training, 5% for validation, and 30% for testing) |
| Hardware Specification | No | The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments. |
| Software Dependencies | No | The paper mentions using 'Faster R-CNN' and 'GloVe embedding' but does not provide specific version numbers for these or any other software dependencies, which are necessary for reproducibility. |
| Experiment Setup | Yes | The number of object sequences per video for MSVD-QA , MSRVTT-QA is 40 and TGIF-QA is 50. We embed question words into 300-D vectors and initialize them with Glo Ve during training. Default settings are with 6 GCN layers for each OSTR unit. The feature dimension d is set to be 512 in all sub-networks. We use the cross-entropy as the loss function to training the model from end to end for all tasks except counting, where Mean Square Error is used. |