Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
Beyond FVD: An Enhanced Evaluation Metrics for Video Generation Distribution Quality
Authors: Ge Ya Luo, Gian M Favero, Zhi Hao Luo, Alexia Jolicoeur-Martineau, Christopher Pal
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments on multiple open-source datasets show clear evidence that it is a superior alternative to the widely used FVD metric, requiring only 16% of the samples to reach its steady value, while increasing alignment with human evaluation by 34%, on average. |
| Researcher Affiliation | Collaboration | Mila Quebec Artificial Intelligence Institute Alexia Jolicoeur-Martineau Samsung SAIT AI Lab, Montreal |
| Pseudocode | Yes | Algorithm 1: I3D Autoencoder Configuration |
| Open Source Code | Yes | Project page: https://oooolga.github.io/JEDi.github.io/. |
| Open Datasets | Yes | Our experiments on multiple open-source datasets show clear evidence that it is a superior alternative to the widely used FVD metric... Specifically, we extract a maximum of 5000 features from the training set of each dataset, which include: Anime Run-v2 (Siyao et al., 2022), BAIR (Ebert et al., 2017), BDD100k (Yu et al., 2020), DAVIS (Pont Tuset et al., 2018), Fashion Modeling (Zablotskaia et al., 2019), HMDB-51 (Kuehne et al., 2011), How2Sign (Duarte et al., 2021), KITTI (Geiger et al., 2013), Something-Something-v2 (Goyal et al., 2017), Sky Scene (Xiong et al., 2018), and UCF-101 (Soomro et al., 2012). |
| Dataset Splits | Yes | Specifically, we extract a maximum of 5000 features from the training set of each dataset... transform 5,256 I3D features (up to 500 samples from each of the eleven datasets testing sets) into lower-dimensional spaces for visualizations. |
| Hardware Specification | Yes | Feature extraction is performed on a single NVIDIA RTX 4080 GPU with float32 precision. However, Video MAE-v2 features require a more specialized setup: a single NVIDIA RTX A100 GPU with 80G memory. |
| Software Dependencies | No | The paper mentions 'torchaudio s functional API' and 'Py Torch implementation' but does not specify their version numbers. |
| Experiment Setup | Yes | We use batch sizes of 10 (clips < 64 frames) and 2 (clips ≥ 64 frames) for feature extraction. The testing video dataset is subjected to noise distortions, including low blur (σ [0.05, 0.75]), medium blur (σ [0.1, 1.5]), and high blur (σ [0.01, 3]), where σ represents the per-frame blur intensity, and a larger range indicates greater temporal inconsistency. |