Glance and Focus: Memory Prompting for Multi-Event Video Question Answering

Authors: Ziyi Bai, Ruiping Wang, Xilin Chen

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments on four Multi-Event Video QA benchmarks including STAR, Ego Task QA, AGQA, and NEx T-QA. Our proposed model achieves state-of-the-art results, surpassing current large models in various challenging reasoning tasks.
Researcher Affiliation Academia 1Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, 100190, China 2University of Chinese Academy of Sciences, Beijing, 100049, China
Pseudocode No The paper does not contain a pseudocode block or a clearly labeled algorithm block.
Open Source Code Yes The code and models are available at https://github.com/ByZ0e/Glance-Focus.
Open Datasets Yes We conduct extensive experiments on four Multi-Event Video QA benchmarks including STAR, Ego Task QA, AGQA, and NEx T-QA.
Dataset Splits Yes For each benchmark, we follow standard protocols outlined by prior works for data processing, metrics, and settings to ensure fair comparisons.
Hardware Specification Yes All experiments are conducted on an NVIDIA Ge Force RTX 3090Ti GPU.
Software Dependencies No The paper mentions software components like S3D, C3D, Faster-RCNN, CLIP, Transformer, RoBERTa, and Adam optimizer, but does not provide specific version numbers for these or other ancillary software dependencies.
Experiment Setup Yes We employ a standard 2-layer, 8-head Transformer Encoder-Decoder with hidden size D of 512 as the backbone for our Glance-Focus model. ... For training details, we use dropout of 0.1, and initialize model weights using Xavier init[13]. Adam optimizer[20] is used with a learning rate of 5e-6 to optimize model parameters.