Evaluation of Text-to-Video Generation Models: A Dynamics Perspective
Authors: Mingxiang Liao, hannan lu, Qixiang Ye, Wangmeng Zuo, Fang Wan, Tianyu Wang, Yuzhong Zhao, Jingdong Wang, Xinyu Zhang
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that DEVIL evaluation metrics enjoy up to about 90% consistency with human ratings, demonstrating the potential to advance T2V generation models. |
| Researcher Affiliation | Collaboration | 1University of Chinese Academy of Sciences 2Harbin Institute of Technology 3The University of Adelaide 4Baidu Inc. |
| Pseudocode | No | The paper provides equations and descriptive steps but does not include a clearly labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | No | The paper provides a 'Project page: t2veval.github.io/DEVIL/' which is a project overview page, not a direct link to a source-code repository. |
| Open Datasets | Yes | The text prompts are collected from commonly used datasets [7, 6, 46, 41] and categorized to dynamics grades using GPT-4 [30] and human refinement. |
| Dataset Splits | No | The paper mentions training on 75% of data and testing on 25% but does not explicitly describe a separate validation split. |
| Hardware Specification | Yes | Our dynamics metrics offer high computational efficiency, achieving around 10 frames per second on a single NVIDIA A100 GPU, and are scalable to multiple GPUs. |
| Software Dependencies | Yes | We employed the advanced multi-modal large model, Gemini-1.5 Pro [1], equipped with video understanding capabilities, to assess and classify the naturalness of video content. |
| Experiment Setup | Yes | For each linear regression model, it takes the human evaluation results as ground-truths, trained upon 75% of the randomly selected videos and tests on the remaining 25% videos. |