Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Evaluation of Text-to-Video Generation Models: A Dynamics Perspective
Authors: Mingxiang Liao, hannan lu, Qixiang Ye, Wangmeng Zuo, Fang Wan, Tianyu Wang, Yuzhong Zhao, Jingdong Wang, Xinyu Zhang
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that DEVIL evaluation metrics enjoy up to about 90% consistency with human ratings, demonstrating the potential to advance T2V generation models. |
| Researcher Affiliation | Collaboration | 1University of Chinese Academy of Sciences 2Harbin Institute of Technology 3The University of Adelaide 4Baidu Inc. |
| Pseudocode | No | The paper provides equations and descriptive steps but does not include a clearly labeled 'Pseudocode' or 'Algorithm' block. |
| Open Source Code | No | The paper provides a 'Project page: t2veval.github.io/DEVIL/' which is a project overview page, not a direct link to a source-code repository. |
| Open Datasets | Yes | The text prompts are collected from commonly used datasets [7, 6, 46, 41] and categorized to dynamics grades using GPT-4 [30] and human refinement. |
| Dataset Splits | No | The paper mentions training on 75% of data and testing on 25% but does not explicitly describe a separate validation split. |
| Hardware Specification | Yes | Our dynamics metrics offer high computational efficiency, achieving around 10 frames per second on a single NVIDIA A100 GPU, and are scalable to multiple GPUs. |
| Software Dependencies | Yes | We employed the advanced multi-modal large model, Gemini-1.5 Pro [1], equipped with video understanding capabilities, to assess and classify the naturalness of video content. |
| Experiment Setup | Yes | For each linear regression model, it takes the human evaluation results as ground-truths, trained upon 75% of the randomly selected videos and tests on the remaining 25% videos. |