Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Beyond FVD: An Enhanced Evaluation Metrics for Video Generation Distribution Quality

Authors: Ge Ya Luo, Gian M Favero, Zhi Hao Luo, Alexia Jolicoeur-Martineau, Christopher Pal

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments on multiple open-source datasets show clear evidence that it is a superior alternative to the widely used FVD metric, requiring only 16% of the samples to reach its steady value, while increasing alignment with human evaluation by 34%, on average.
Researcher Affiliation	Collaboration	Mila Quebec Artificial Intelligence Institute Alexia Jolicoeur-Martineau Samsung SAIT AI Lab, Montreal
Pseudocode	Yes	Algorithm 1: I3D Autoencoder Configuration
Open Source Code	Yes	Project page: https://oooolga.github.io/JEDi.github.io/.
Open Datasets	Yes	Our experiments on multiple open-source datasets show clear evidence that it is a superior alternative to the widely used FVD metric... Specifically, we extract a maximum of 5000 features from the training set of each dataset, which include: Anime Run-v2 (Siyao et al., 2022), BAIR (Ebert et al., 2017), BDD100k (Yu et al., 2020), DAVIS (Pont Tuset et al., 2018), Fashion Modeling (Zablotskaia et al., 2019), HMDB-51 (Kuehne et al., 2011), How2Sign (Duarte et al., 2021), KITTI (Geiger et al., 2013), Something-Something-v2 (Goyal et al., 2017), Sky Scene (Xiong et al., 2018), and UCF-101 (Soomro et al., 2012).
Dataset Splits	Yes	Specifically, we extract a maximum of 5000 features from the training set of each dataset... transform 5,256 I3D features (up to 500 samples from each of the eleven datasets testing sets) into lower-dimensional spaces for visualizations.
Hardware Specification	Yes	Feature extraction is performed on a single NVIDIA RTX 4080 GPU with float32 precision. However, Video MAE-v2 features require a more specialized setup: a single NVIDIA RTX A100 GPU with 80G memory.
Software Dependencies	No	The paper mentions 'torchaudio s functional API' and 'Py Torch implementation' but does not specify their version numbers.
Experiment Setup	Yes	We use batch sizes of 10 (clips < 64 frames) and 2 (clips ≥ 64 frames) for feature extraction. The testing video dataset is subjected to noise distortions, including low blur (σ [0.05, 0.75]), medium blur (σ [0.1, 1.5]), and high blur (σ [0.01, 3]), where σ represents the per-frame blur intensity, and a larger range indicates greater temporal inconsistency.