Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Beyond FVD: An Enhanced Evaluation Metrics for Video Generation Distribution Quality
Authors: Ge Ya Luo, Gian M Favero, Zhi Hao Luo, Alexia Jolicoeur-Martineau, Christopher Pal
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on multiple open-source datasets show clear evidence that it is a superior alternative to the widely used FVD metric, requiring only 16% of the samples to reach its steady value, while increasing alignment with human evaluation by 34%, on average. |
| Researcher Affiliation | Collaboration | Mila Quebec Artificial Intelligence Institute Alexia Jolicoeur-Martineau Samsung SAIT AI Lab, Montreal |
| Pseudocode | Yes | Algorithm 1: I3D Autoencoder Configuration |
| Open Source Code | Yes | Project page: https://oooolga.github.io/JEDi.github.io/. |
| Open Datasets | Yes | Our experiments on multiple open-source datasets show clear evidence that it is a superior alternative to the widely used FVD metric... Specifically, we extract a maximum of 5000 features from the training set of each dataset, which include: Anime Run-v2 (Siyao et al., 2022), BAIR (Ebert et al., 2017), BDD100k (Yu et al., 2020), DAVIS (Pont Tuset et al., 2018), Fashion Modeling (Zablotskaia et al., 2019), HMDB-51 (Kuehne et al., 2011), How2Sign (Duarte et al., 2021), KITTI (Geiger et al., 2013), Something-Something-v2 (Goyal et al., 2017), Sky Scene (Xiong et al., 2018), and UCF-101 (Soomro et al., 2012). |
| Dataset Splits | Yes | Specifically, we extract a maximum of 5000 features from the training set of each dataset... transform 5,256 I3D features (up to 500 samples from each of the eleven datasets testing sets) into lower-dimensional spaces for visualizations. |
| Hardware Specification | Yes | Feature extraction is performed on a single NVIDIA RTX 4080 GPU with float32 precision. However, Video MAE-v2 features require a more specialized setup: a single NVIDIA RTX A100 GPU with 80G memory. |
| Software Dependencies | No | The paper mentions 'torchaudio s functional API' and 'Py Torch implementation' but does not specify their version numbers. |
| Experiment Setup | Yes | We use batch sizes of 10 (clips < 64 frames) and 2 (clips ≥ 64 frames) for feature extraction. The testing video dataset is subjected to noise distortions, including low blur (σ [0.05, 0.75]), medium blur (σ [0.1, 1.5]), and high blur (σ [0.01, 3]), where σ represents the per-frame blur intensity, and a larger range indicates greater temporal inconsistency. |