Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition
Authors: Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, Wynne Hsu
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments across various complex video QA benchmarks demonstrate that our overall framework strikingly boosts existing state-of-the-art. To our knowledge, this is the first attempt at successfully implementing the Co T technique for achieving humanlevel video reasoning, where we show great potential in extending it to a wider range of video understanding scenarios. |
| Researcher Affiliation | Academia | 1National University of Singapore, Singapore 2Nanyang Technological University, Singapore 3Harbin Institute of Technology (Shenzhen), China. |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures). |
| Open Source Code | Yes | The project is open at https://haofei.vip/Vo T. |
| Open Datasets | Yes | For fine-tuning setting, we adopt 6 benchmarks characterizing complex video QA where advanced video abilities, e.g., explanation, causality, foresight and imagination are required: VLEP (Lei et al., 2020), STAR (Wu et al., 2021), Intent QA (Li et al., 2023b), Social-IQ (Zadeh et al., 2019), Causal Vid QA (Li et al., 2022a) and NEx T-QA (Xiao et al., 2021). For zero-shot setting, we further consider using MSR-VTT (Xu et al., 2016) and Activity Net (Heilbron et al., 2015) datasets. |
| Dataset Splits | Yes | All datasets come with their own splitting, and we follow the prior practice without modification. |
| Hardware Specification | Yes | All trainings are conducted on 16 NVIDIA A100 GPUs. |
| Software Dependencies | No | The paper mentions models and components like 'Vicuna-7B (v1.5)', 'Vi T-L/14', 'Q-Former', and 'LLa MA' for tokenizer, but does not provide specific version numbers for general software dependencies such as programming languages, deep learning frameworks (e.g., PyTorch, TensorFlow), or CUDA libraries required for reproducibility. |
| Experiment Setup | Yes | For each video, we uniformly sample certain frames with a sampling rate of 8 fps for fine-grained reasoning. |