On Learning Multi-Modal Forgery Representation for Diffusion Generated Video Detection
Authors: Xiufeng Song, Xiao Guo, Jiache Zhang, Qirui Li, LEI BAI, Xiaoming Liu, Guangtao Zhai, Xiaohong Liu
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | MM-Det achieves state-of-the-art performance in DVF, demonstrating the effectiveness of our algorithm. Both source code and DVF are available at link. |
| Researcher Affiliation | Collaboration | 1Shanghai Jiao Tong University 2Michigan State University 3Shanghai Artificial Intelligence Laboratory {akikaze, zjc he, iapple1, zhaiguangtao, xiaohongliu}@sjtu.edu.cn {guoxia11, liuxm}@cse.msu.edu baisanshi@gmail.com Corresponding Author |
| Pseudocode | No | The paper describes the system architecture and components (LMM Branch, ST Branch, Dynamic Fusion) in detail through text and diagrams (e.g., Fig. 4), but it does not present formal pseudocode or a clearly labeled algorithm block. |
| Open Source Code | Yes | Both source code and DVF are available at link. |
| Open Datasets | Yes | We construct a large-scale dataset for the video forensic task named Diffusion Video Forensics (DVF), as shown in Fig. 6. DVF contains 8 diffusion generative methods, including Stable Diffusion [42], Video Crafter1 [5], Zeroscope, Sora, Pika, Open Sora, Stable Video, and Stable Video Diffusion[4]. ... Both source code and DVF are available at link. |
| Dataset Splits | Yes | In training, 1, 000 videos from You Tube and 1, 973 fake videos generated by Stable Video Diffusion serve as the training set, in which 90% are used for training and the remaining 10% for validation. |
| Hardware Specification | Yes | As for the experimental resources in training and inference, we conduct all experiments using a single NVIDIA RTX 3090 GPU and a maximum of 200G memory. |
| Software Dependencies | No | The paper mentions key software components like 'LLa VA [28] v1.5', 'CLIP [39] encoder E of CLIP-Vi T-L-patch14-336', 'large language model D of Vicuna-7b', and 'Lo RA [21]', and 'Adam optimizer'. While LLaVA v1.5 and Vicuna-7b have versions, a comprehensive list of all critical software dependencies (e.g., Python, PyTorch, CUDA versions) with specific version numbers is not provided for full reproducibility. |
| Experiment Setup | Yes | We use an Adam optimizer with the learning rate set as 2e 5 for 10 epochs. ... The training set is split into 9 : 1 for training and validation data. For each video, successive 10 frames are randomly sampled and cropped into 224 224 as the input. ... We use an Adam optimizer with the learning rate set as 1e 4 for training until the model converges. |