(2.5+1)D Spatio-Temporal Scene Graphs for Video Question Answering
Authors: Anoop Cherian, Chiori Hori, Tim K. Marks, Jonathan Le Roux444-453
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To demonstrate the effectiveness of our approach, we present experiments on the NEx T-QA and AVSD-QA datasets. Our results show that our proposed (2.5+1)D representation leads to faster training and inference, while our hierarchical model showcases superior performance on the video QA task versus the state of the art.In this section, we provide experiments demonstrating the empirical beneļ¬ts of our proposed representation and inference pipeline. |
| Researcher Affiliation | Industry | Anoop Cherian, Chiori Hori, Tim K. Marks, Jonathan Le Roux Mitsubishi Electric Research Labs (MERL), Cambridge, MA {cherian, chori, tmarks, leroux}@merl.com |
| Pseudocode | Yes | Algorithm 1: Identifying common ancestors for merging |
| Open Source Code | No | The paper mentions using code provided by other authors (e.g., 'We use the code provided by the authors of (Xiao et al. 2021)', 'we used an implementation that is shared by the authors of (Geng et al. 2021)'), but does not explicitly state that their own source code for the described methodology is publicly available or provide a link to it. |
| Open Datasets | Yes | We used two recent video QA datasets for evaluating our task, namely NEx T-QA (Xiao et al. 2021) and AVSD-QA (Alamri et al. 2019a). |
| Dataset Splits | Yes | NEx T-QA Dataset ... consists of 3,870 training, 570 validation, and 1,000 test videos. The dataset provides 34,132, 4,996, and 8,564 multiple choice questions in the training, validation, and test sets respectively... AVSD-QA ... to use 7,985, 1,863, and 1,968 clips for training, validation, and test. |
| Hardware Specification | Yes | our experiments show that the time taken for every training iteration in this case slows down 4-fold (from 1.5 s per iteration to 6 s on a single RTX6000 GPU). |
| Software Dependencies | No | The paper mentions specific models and frameworks (e.g., 'Faster RCNN', 'Mi DAS model', 'I3D action recognition neural network', 'BERT features') but does not provide version numbers for these or any other ancillary software components (e.g., Python, PyTorch versions). |
| Experiment Setup | Yes | For NEx T-QA, we used a learning rate of 5e-5 as suggested in the paper with a batch size of 64 and trained for 50 epochs, while AVSD-QA used a learning rate of 1e-3 and a batch size of 100, and trained for 20 epochs. ... For the Transformer, we used a 4-headed attention for NEx T-QA, and a 2-headed attention for AVSD-QA. |