TOPA: Extending Large Language Models for Video Understanding via Text-Only Pre-Alignment
Authors: Wei Li, Hehe Fan, Yongkang Wong, Mohan S. Kankanhalli, Yi Yang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments, including zero-shot evaluation and finetuning on various video understanding tasks, demonstrate that TOPA is an effective and efficient framework for aligning video modality with LLMs. |
| Researcher Affiliation | Academia | 1 Re LER Lab, CCAI, Zhejiang University, China 2 The State Key Laboratory of Brain-Machine Intelligence, Zhejiang University, China 3 School of Computing, National University of Singapore, Singapore |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | https://github.com/dhg-wei/TOPA |
| Open Datasets | Yes | We introduce Text Vid, a textual video dataset automatically generated by advanced LLMs. Text Vid dataset comprises 721K diverse Tideos along with associated high-quality annotations, which include detailed Tideo descriptions and a variety of question-answer pairs. [...] We evaluate TOPA on Ne XT-QA [72], STAR [70], TVQA [24], recent challenging Ego Schema [39] and MVBench[29] benchmarks with the zero-shot setting. |
| Dataset Splits | Yes | NEx T-QA [72] is a multi-choice video QA benchmark for causal and temporal reasoning, including 5,440 natural videos. The average length of video is 44 seconds. We report results on NEx T-QA validation set, which contains 570 videos and 5,000 multiple-choice questions. |
| Hardware Specification | Yes | TOPA-Llama2-7B and TOPA-Llama3-8B are trained on four 40G-A100 GPUs in one day. TOPA-Llama2-13B is trained in two days. |
| Software Dependencies | No | The paper mentions specific models (Llama2-7B, Llama2-13B, Llama3-8B, Llama-adapter, CLIP-Vi T-L) and an optimizer (Adam W) but does not provide specific version numbers for underlying software dependencies like Python, PyTorch, or TensorFlow. |
| Experiment Setup | Yes | We include the training details in Table 12. The actual learning rate is calculated by base lr Effective Batchsize/256. Table 12: Training hyper-parameters. Model Training Dataset Epoch Effective Batchsize base Optimizer (bs, #GPUs, grad accu) lr Pre-training TOPA-LLama2-7B Text Vid 721K 20 18x4x4 5e-3 Adam W TOPA-LLama2-13B 4x4x8 8e-3 weight decay 0.1 TOPA-LLama3-8B 14x4x8 5e-3 warm up 1 epoch |