Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition

Authors: Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, Wynne Hsu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments across various complex video QA benchmarks demonstrate that our overall framework strikingly boosts existing state-of-the-art. To our knowledge, this is the first attempt at successfully implementing the Co T technique for achieving humanlevel video reasoning, where we show great potential in extending it to a wider range of video understanding scenarios.
Researcher Affiliation Academia 1National University of Singapore, Singapore 2Nanyang Technological University, Singapore 3Harbin Institute of Technology (Shenzhen), China.
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures).
Open Source Code Yes The project is open at https://haofei.vip/Vo T.
Open Datasets Yes For fine-tuning setting, we adopt 6 benchmarks characterizing complex video QA where advanced video abilities, e.g., explanation, causality, foresight and imagination are required: VLEP (Lei et al., 2020), STAR (Wu et al., 2021), Intent QA (Li et al., 2023b), Social-IQ (Zadeh et al., 2019), Causal Vid QA (Li et al., 2022a) and NEx T-QA (Xiao et al., 2021). For zero-shot setting, we further consider using MSR-VTT (Xu et al., 2016) and Activity Net (Heilbron et al., 2015) datasets.
Dataset Splits Yes All datasets come with their own splitting, and we follow the prior practice without modification.
Hardware Specification Yes All trainings are conducted on 16 NVIDIA A100 GPUs.
Software Dependencies No The paper mentions models and components like 'Vicuna-7B (v1.5)', 'Vi T-L/14', 'Q-Former', and 'LLa MA' for tokenizer, but does not provide specific version numbers for general software dependencies such as programming languages, deep learning frameworks (e.g., PyTorch, TensorFlow), or CUDA libraries required for reproducibility.
Experiment Setup Yes For each video, we uniformly sample certain frames with a sampling rate of 8 fps for fine-grained reasoning.