Streaming Long Video Understanding with Large Language Models
Authors: Rui Qian, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Shuangrui Ding, Dahua Lin, Jiaqi Wang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive experiments, our model achieves superior performance and higher efficiency on long video benchmarks, showcasing precise temporal comprehension for detailed question answering. |
| Researcher Affiliation | Collaboration | Rui Qian1 Xiaoyi Dong1,2 Pan Zhang2 Yuhang Zang2 Shuangrui Ding1 Dahua Lin1,2,3 Jiaqi Wang2 1 The Chinese University of Hong Kong 2 Shanghai AI Laboratory 3 HKGAI under Inno HK |
| Pseudocode | No | The paper does not contain a figure, block, or section explicitly labeled "Pseudocode," "Algorithm," or "Algorithm X," nor does it present structured steps for a method or procedure formatted like code. |
| Open Source Code | No | Additionally, the code, model checkpoint, and data will be publicly released. |
| Open Datasets | Yes | We evaluate our model on long video QA datasets and present the statistics on the temporal duration of individual datasets in Table. 1. Among them, Next-QA [79], Next-GQA [80] and Video Chat GPT [50] encompass minute-long videos with thousands of frames. Ego Schema [51] contains over 5K three-minute videos with multiple-choice questions. Each question has a long temporal certificate, requiring more than 100 seconds within a video to produce a correct answer. Movie Chat-1K [66] and Movie Net QA [68] consist of around ten-minute-long or even hour-long movies, posing significant challenges for the model to comprehend the visual contents across such long time spans. |
| Dataset Splits | Yes | Next-QA. In Table 4, we perform zero-shot evaluation on the validation split of Next-QA [79] covering 5K multiple-choice questions. |
| Hardware Specification | Yes | The whole training is conducted on 32 A100 (80G) GPUs for around 2.5 days. |
| Software Dependencies | No | The paper mentions specific models like "Phi-2 2.7B" and "Vicuna-7B," and components like "CLIP ViT-L/14" and "Adam W optimizer," but it does not provide specific version numbers for underlying software frameworks or libraries like Python, PyTorch, or CUDA, which are typically considered key software dependencies for replication. |
| Experiment Setup | Yes | In the first training stage, we initially freeze Phi2, and only tune the MLP projector on 790K caption pairs, including 558K image caption data from CC3M [64] and 232K short video caption data from Web Vid 2.5M [6]. Following LLa VA [46, 45], we use Adam W optimizer [47] with global batchsize 256, initial learning rate 1 × 10−3 with cosine decay to train 1 epoch for modality alignment. Subsequently, we jointly train Phi-2 and the MLP projector on 763K QA pairs, including 625K image QA pairs [18, 26, 33, 37, 46, 52, 53, 63, 65], 40K text conversations [1] and 98K video QA pairs [10], with global batchsize 128, initial learning rate 2 × 10−5 with cosine decay. |