TOPA: Extending Large Language Models for Video Understanding via Text-Only Pre-Alignment

Authors: Wei Li, Hehe Fan, Yongkang Wong, Mohan S. Kankanhalli, Yi Yang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments, including zero-shot evaluation and finetuning on various video understanding tasks, demonstrate that TOPA is an effective and efficient framework for aligning video modality with LLMs.
Researcher Affiliation Academia 1 Re LER Lab, CCAI, Zhejiang University, China 2 The State Key Laboratory of Brain-Machine Intelligence, Zhejiang University, China 3 School of Computing, National University of Singapore, Singapore
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes https://github.com/dhg-wei/TOPA
Open Datasets Yes We introduce Text Vid, a textual video dataset automatically generated by advanced LLMs. Text Vid dataset comprises 721K diverse Tideos along with associated high-quality annotations, which include detailed Tideo descriptions and a variety of question-answer pairs. [...] We evaluate TOPA on Ne XT-QA [72], STAR [70], TVQA [24], recent challenging Ego Schema [39] and MVBench[29] benchmarks with the zero-shot setting.
Dataset Splits Yes NEx T-QA [72] is a multi-choice video QA benchmark for causal and temporal reasoning, including 5,440 natural videos. The average length of video is 44 seconds. We report results on NEx T-QA validation set, which contains 570 videos and 5,000 multiple-choice questions.
Hardware Specification Yes TOPA-Llama2-7B and TOPA-Llama3-8B are trained on four 40G-A100 GPUs in one day. TOPA-Llama2-13B is trained in two days.
Software Dependencies No The paper mentions specific models (Llama2-7B, Llama2-13B, Llama3-8B, Llama-adapter, CLIP-Vi T-L) and an optimizer (Adam W) but does not provide specific version numbers for underlying software dependencies like Python, PyTorch, or TensorFlow.
Experiment Setup Yes We include the training details in Table 12. The actual learning rate is calculated by base lr Effective Batchsize/256. Table 12: Training hyper-parameters. Model Training Dataset Epoch Effective Batchsize base Optimizer (bs, #GPUs, grad accu) lr Pre-training TOPA-LLama2-7B Text Vid 721K 20 18x4x4 5e-3 Adam W TOPA-LLama2-13B 4x4x8 8e-3 weight decay 0.1 TOPA-LLama3-8B 14x4x8 5e-3 warm up 1 epoch