Evaluation of Text-to-Video Generation Models: A Dynamics Perspective

Authors: Mingxiang Liao, hannan lu, Qixiang Ye, Wangmeng Zuo, Fang Wan, Tianyu Wang, Yuzhong Zhao, Jingdong Wang, Xinyu Zhang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that DEVIL evaluation metrics enjoy up to about 90% consistency with human ratings, demonstrating the potential to advance T2V generation models.
Researcher Affiliation Collaboration 1University of Chinese Academy of Sciences 2Harbin Institute of Technology 3The University of Adelaide 4Baidu Inc.
Pseudocode No The paper provides equations and descriptive steps but does not include a clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code No The paper provides a 'Project page: t2veval.github.io/DEVIL/' which is a project overview page, not a direct link to a source-code repository.
Open Datasets Yes The text prompts are collected from commonly used datasets [7, 6, 46, 41] and categorized to dynamics grades using GPT-4 [30] and human refinement.
Dataset Splits No The paper mentions training on 75% of data and testing on 25% but does not explicitly describe a separate validation split.
Hardware Specification Yes Our dynamics metrics offer high computational efficiency, achieving around 10 frames per second on a single NVIDIA A100 GPU, and are scalable to multiple GPUs.
Software Dependencies Yes We employed the advanced multi-modal large model, Gemini-1.5 Pro [1], equipped with video understanding capabilities, to assess and classify the naturalness of video content.
Experiment Setup Yes For each linear regression model, it takes the human evaluation results as ground-truths, trained upon 75% of the randomly selected videos and tests on the remaining 25% videos.