Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
TOPA: Extending Large Language Models for Video Understanding via Text-Only Pre-Alignment
Authors: Wei Li, Hehe Fan, Yongkang Wong, Mohan S. Kankanhalli, Yi Yang
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments, including zero-shot evaluation and finetuning on various video understanding tasks, demonstrate that TOPA is an effective and efficient framework for aligning video modality with LLMs. |
| Researcher Affiliation | Academia | 1 Re LER Lab, CCAI, Zhejiang University, China 2 The State Key Laboratory of Brain-Machine Intelligence, Zhejiang University, China 3 School of Computing, National University of Singapore, Singapore |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | https://github.com/dhg-wei/TOPA |
| Open Datasets | Yes | We introduce Text Vid, a textual video dataset automatically generated by advanced LLMs. Text Vid dataset comprises 721K diverse Tideos along with associated high-quality annotations, which include detailed Tideo descriptions and a variety of question-answer pairs. [...] We evaluate TOPA on Ne XT-QA [72], STAR [70], TVQA [24], recent challenging Ego Schema [39] and MVBench[29] benchmarks with the zero-shot setting. |
| Dataset Splits | Yes | NEx T-QA [72] is a multi-choice video QA benchmark for causal and temporal reasoning, including 5,440 natural videos. The average length of video is 44 seconds. We report results on NEx T-QA validation set, which contains 570 videos and 5,000 multiple-choice questions. |
| Hardware Specification | Yes | TOPA-Llama2-7B and TOPA-Llama3-8B are trained on four 40G-A100 GPUs in one day. TOPA-Llama2-13B is trained in two days. |
| Software Dependencies | No | The paper mentions specific models (Llama2-7B, Llama2-13B, Llama3-8B, Llama-adapter, CLIP-Vi T-L) and an optimizer (Adam W) but does not provide specific version numbers for underlying software dependencies like Python, PyTorch, or TensorFlow. |
| Experiment Setup | Yes | We include the training details in Table 12. The actual learning rate is calculated by base lr Effective Batchsize/256. Table 12: Training hyper-parameters. Model Training Dataset Epoch Effective Batchsize base Optimizer (bs, #GPUs, grad accu) lr Pre-training TOPA-LLama2-7B Text Vid 721K 20 18x4x4 5e-3 Adam W TOPA-LLama2-13B 4x4x8 8e-3 weight decay 0.1 TOPA-LLama3-8B 14x4x8 5e-3 warm up 1 epoch |