Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Beyond FVD: An Enhanced Evaluation Metrics for Video Generation Distribution Quality

Authors: Ge Ya Luo, Gian M Favero, Zhi Hao Luo, Alexia Jolicoeur-Martineau, Christopher Pal

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments on multiple open-source datasets show clear evidence that it is a superior alternative to the widely used FVD metric, requiring only 16% of the samples to reach its steady value, while increasing alignment with human evaluation by 34%, on average.
Researcher Affiliation Collaboration Mila Quebec Artificial Intelligence Institute Alexia Jolicoeur-Martineau Samsung SAIT AI Lab, Montreal
Pseudocode Yes Algorithm 1: I3D Autoencoder Configuration
Open Source Code Yes Project page: https://oooolga.github.io/JEDi.github.io/.
Open Datasets Yes Our experiments on multiple open-source datasets show clear evidence that it is a superior alternative to the widely used FVD metric... Specifically, we extract a maximum of 5000 features from the training set of each dataset, which include: Anime Run-v2 (Siyao et al., 2022), BAIR (Ebert et al., 2017), BDD100k (Yu et al., 2020), DAVIS (Pont Tuset et al., 2018), Fashion Modeling (Zablotskaia et al., 2019), HMDB-51 (Kuehne et al., 2011), How2Sign (Duarte et al., 2021), KITTI (Geiger et al., 2013), Something-Something-v2 (Goyal et al., 2017), Sky Scene (Xiong et al., 2018), and UCF-101 (Soomro et al., 2012).
Dataset Splits Yes Specifically, we extract a maximum of 5000 features from the training set of each dataset... transform 5,256 I3D features (up to 500 samples from each of the eleven datasets testing sets) into lower-dimensional spaces for visualizations.
Hardware Specification Yes Feature extraction is performed on a single NVIDIA RTX 4080 GPU with float32 precision. However, Video MAE-v2 features require a more specialized setup: a single NVIDIA RTX A100 GPU with 80G memory.
Software Dependencies No The paper mentions 'torchaudio s functional API' and 'Py Torch implementation' but does not specify their version numbers.
Experiment Setup Yes We use batch sizes of 10 (clips < 64 frames) and 2 (clips ≥ 64 frames) for feature extraction. The testing video dataset is subjected to noise distortions, including low blur (σ [0.05, 0.75]), medium blur (σ [0.1, 1.5]), and high blur (σ [0.01, 3]), where σ represents the per-frame blur intensity, and a larger range indicates greater temporal inconsistency.