MECD: Unlocking Multi-Event Causal Discovery in Video Reasoning

Authors: Tieyuan Chen, Huabin Liu, Tianyao He, Yihang Chen, chaofan gan, Xiao Ma, Cheng Zhong, Yang Zhang, Yingxue Wang, Hui Lin, Weiyao Lin

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments validate the effectiveness of our framework in providing causal relationships in multi-event videos, outperforming GPT-4o and Video LLa VA by 5.7% and 4.1%, respectively.
Researcher Affiliation Collaboration 1 Shanghai Jiao Tong University, 2 Lenovo Research, AI Lab, 3 China Academic of Electronics and Information Technology
Pseudocode No The paper provides model architecture diagrams and mathematical equations but does not include a pseudocode block or algorithm.
Open Source Code Yes https://github.com/tychen-SJTU/MECD-Benchmark
Open Datasets Yes The Activity Net Captions dataset [32] is built on Activity Net v1.3 which includes 20k 120-second You Tube untrimmed videos. ... We call this new dataset as MECD dataset, where 806 and 299 videos are randomly split for training and testing, respectively. [32] Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Densecaptioning events in videos. In Proceedings of the IEEE international conference on computer vision, pages 706 715, 2017.
Dataset Splits No The paper states that 806 videos are split for training and 299 for testing, but it does not explicitly mention a separate validation split or its size. While common, it is not explicitly provided in the text.
Hardware Specification Yes All the experiments are conducted on 1 NVIDIA A40 GPU. ... The inference speed experiments were conducted on 1 NVIDIA A6000 GPU.
Software Dependencies No The paper mentions using Bert Adam optimizer and building upon Videobert, as well as using GPT-4 API for data generation. However, it does not specify version numbers for these software components or other libraries used in the implementation.
Experiment Setup Yes We train our model for 20 epochs with a learning rate of 16e-5 about 6 hours. Our optimizer is consistent with Bert Adam [50] optimizer, with 3 epochs of warm-up. ... Hyperparameters λC, λR, λV , λS are set to be 1.0, 4.0, 0.25, 0.05. Maximum input lengths of the caption, the chain of thoughts, and the existence-only descriptions are set to 50.