reproducibilityindex.ai

Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition

Authors: Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, Wynne Hsu

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments across various complex video QA benchmarks demonstrate that our overall framework strikingly boosts existing state-of-the-art. To our knowledge, this is the first attempt at successfully implementing the Co T technique for achieving humanlevel video reasoning, where we show great potential in extending it to a wider range of video understanding scenarios.
Researcher Affiliation	Academia	1National University of Singapore, Singapore 2Nanyang Technological University, Singapore 3Harbin Institute of Technology (Shenzhen), China.
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures).
Open Source Code	Yes	The project is open at https://haofei.vip/Vo T.
Open Datasets	Yes	For fine-tuning setting, we adopt 6 benchmarks characterizing complex video QA where advanced video abilities, e.g., explanation, causality, foresight and imagination are required: VLEP (Lei et al., 2020), STAR (Wu et al., 2021), Intent QA (Li et al., 2023b), Social-IQ (Zadeh et al., 2019), Causal Vid QA (Li et al., 2022a) and NEx T-QA (Xiao et al., 2021). For zero-shot setting, we further consider using MSR-VTT (Xu et al., 2016) and Activity Net (Heilbron et al., 2015) datasets.
Dataset Splits	Yes	All datasets come with their own splitting, and we follow the prior practice without modification.
Hardware Specification	Yes	All trainings are conducted on 16 NVIDIA A100 GPUs.
Software Dependencies	No	The paper mentions models and components like 'Vicuna-7B (v1.5)', 'Vi T-L/14', 'Q-Former', and 'LLa MA' for tokenizer, but does not provide specific version numbers for general software dependencies such as programming languages, deep learning frameworks (e.g., PyTorch, TensorFlow), or CUDA libraries required for reproducibility.
Experiment Setup	Yes	For each video, we uniformly sample certain frames with a sampling rate of 8 fps for fine-grained reasoning.