reproducibilityindex.ai

Evaluation of Text-to-Video Generation Models: A Dynamics Perspective

Authors: Mingxiang Liao, hannan lu, Qixiang Ye, Wangmeng Zuo, Fang Wan, Tianyu Wang, Yuzhong Zhao, Jingdong Wang, Xinyu Zhang

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments show that DEVIL evaluation metrics enjoy up to about 90% consistency with human ratings, demonstrating the potential to advance T2V generation models.
Researcher Affiliation	Collaboration	1University of Chinese Academy of Sciences 2Harbin Institute of Technology 3The University of Adelaide 4Baidu Inc.
Pseudocode	No	The paper provides equations and descriptive steps but does not include a clearly labeled 'Pseudocode' or 'Algorithm' block.
Open Source Code	No	The paper provides a 'Project page: t2veval.github.io/DEVIL/' which is a project overview page, not a direct link to a source-code repository.
Open Datasets	Yes	The text prompts are collected from commonly used datasets [7, 6, 46, 41] and categorized to dynamics grades using GPT-4 [30] and human refinement.
Dataset Splits	No	The paper mentions training on 75% of data and testing on 25% but does not explicitly describe a separate validation split.
Hardware Specification	Yes	Our dynamics metrics offer high computational efficiency, achieving around 10 frames per second on a single NVIDIA A100 GPU, and are scalable to multiple GPUs.
Software Dependencies	Yes	We employed the advanced multi-modal large model, Gemini-1.5 Pro [1], equipped with video understanding capabilities, to assess and classify the naturalness of video content.
Experiment Setup	Yes	For each linear regression model, it takes the human evaluation results as ground-truths, trained upon 75% of the randomly selected videos and tests on the remaining 25% videos.