Streaming Long Video Understanding with Large Language Models

Authors: Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuangrui Ding, Dahua Lin, Jiaqi Wang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through extensive experiments, our model achieves superior performance and higher efficiency on long video benchmarks, showcasing precise temporal comprehension for detailed question answering.
Researcher Affiliation Collaboration Rui Qian1 Xiaoyi Dong1,2 Pan Zhang2 Yuhang Zang2 Shuangrui Ding1 Dahua Lin1,2,3 Jiaqi Wang2 1 The Chinese University of Hong Kong 2 Shanghai AI Laboratory 3 HKGAI under Inno HK
Pseudocode No The paper does not contain a figure, block, or section explicitly labeled "Pseudocode," "Algorithm," or "Algorithm X," nor does it present structured steps for a method or procedure formatted like code.
Open Source Code No Additionally, the code, model checkpoint, and data will be publicly released.
Open Datasets Yes We evaluate our model on long video QA datasets and present the statistics on the temporal duration of individual datasets in Table. 1. Among them, Next-QA [79], Next-GQA [80] and Video Chat GPT [50] encompass minute-long videos with thousands of frames. Ego Schema [51] contains over 5K three-minute videos with multiple-choice questions. Each question has a long temporal certificate, requiring more than 100 seconds within a video to produce a correct answer. Movie Chat-1K [66] and Movie Net QA [68] consist of around ten-minute-long or even hour-long movies, posing significant challenges for the model to comprehend the visual contents across such long time spans.
Dataset Splits Yes Next-QA. In Table 4, we perform zero-shot evaluation on the validation split of Next-QA [79] covering 5K multiple-choice questions.
Hardware Specification Yes The whole training is conducted on 32 A100 (80G) GPUs for around 2.5 days.
Software Dependencies No The paper mentions specific models like "Phi-2 2.7B" and "Vicuna-7B," and components like "CLIP ViT-L/14" and "Adam W optimizer," but it does not provide specific version numbers for underlying software frameworks or libraries like Python, PyTorch, or CUDA, which are typically considered key software dependencies for replication.
Experiment Setup Yes In the first training stage, we initially freeze Phi2, and only tune the MLP projector on 790K caption pairs, including 558K image caption data from CC3M [64] and 232K short video caption data from Web Vid 2.5M [6]. Following LLa VA [46, 45], we use Adam W optimizer [47] with global batchsize 256, initial learning rate 1 × 10−3 with cosine decay to train 1 epoch for modality alignment. Subsequently, we jointly train Phi-2 and the MLP projector on 763K QA pairs, including 625K image QA pairs [18, 26, 33, 37, 46, 52, 53, 63, 65], 40K text conversations [1] and 98K video QA pairs [10], with global batchsize 128, initial learning rate 2 × 10−5 with cosine decay.