Glance and Focus: Memory Prompting for Multi-Event Video Question Answering
Authors: Ziyi Bai, Ruiping Wang, Xilin Chen
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments on four Multi-Event Video QA benchmarks including STAR, Ego Task QA, AGQA, and NEx T-QA. Our proposed model achieves state-of-the-art results, surpassing current large models in various challenging reasoning tasks. |
| Researcher Affiliation | Academia | 1Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, 100190, China 2University of Chinese Academy of Sciences, Beijing, 100049, China |
| Pseudocode | No | The paper does not contain a pseudocode block or a clearly labeled algorithm block. |
| Open Source Code | Yes | The code and models are available at https://github.com/ByZ0e/Glance-Focus. |
| Open Datasets | Yes | We conduct extensive experiments on four Multi-Event Video QA benchmarks including STAR, Ego Task QA, AGQA, and NEx T-QA. |
| Dataset Splits | Yes | For each benchmark, we follow standard protocols outlined by prior works for data processing, metrics, and settings to ensure fair comparisons. |
| Hardware Specification | Yes | All experiments are conducted on an NVIDIA Ge Force RTX 3090Ti GPU. |
| Software Dependencies | No | The paper mentions software components like S3D, C3D, Faster-RCNN, CLIP, Transformer, RoBERTa, and Adam optimizer, but does not provide specific version numbers for these or other ancillary software dependencies. |
| Experiment Setup | Yes | We employ a standard 2-layer, 8-head Transformer Encoder-Decoder with hidden size D of 512 as the backbone for our Glance-Focus model. ... For training details, we use dropout of 0.1, and initialize model weights using Xavier init[13]. Adam optimizer[20] is used with a learning rate of 5e-6 to optimize model parameters. |