Mutual Information Divergence: A Unified Metric for Multimodal Generative Models
Authors: Jin-Hwa Kim, Yunji Kim, Jiyoung Lee, Kang Min Yoo, Sang-Woo Lee
NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To validate, we extensively compare it with competing metrics using carefully-generated or human-annotated judgments in text-to-image generation and image captioning tasks. The proposed MID significantly outperforms the competitive methods by having consistency across benchmarks, sample parsimony, and robustness toward the exploited CLIP model. |
| Researcher Affiliation | Collaboration | Jin-Hwa Kim NAVER AI Lab, SNU AIIS Republic of Korea j1nhwa.kim@navercorp.com Yunji Kim Jiyoung Lee NAVER AI Lab Republic of Korea {yunji.kim,lee.j}@navercorp.com Kang Min Yoo NAVER AI Lab, CLOVA, SNU AIIS Republic of Korea kangmin.yoo@navercorp.com Sang-Woo Lee NAVER CLOVA, AI Lab, KAIST AI Republic of Korea sang.woo.lee@navercorp.com |
| Pseudocode | No | No pseudocode or algorithm blocks are present in the paper. |
| Open Source Code | Yes | The code is available at https://github.com/naver-ai/mid.metric. |
| Open Datasets | Yes | COCO dataset [2], CUB [28, 29] and Flowers [30], Flickr8K-Expert [48], Flickr8K-CF [48], Pascal-50S [15], FOIL-COCO [40] |
| Dataset Splits | No | The paper evaluates existing models on established benchmarks and human judgment datasets, but does not explicitly detail the training/validation dataset splits used for the specific experimental setup of calculating their metric, beyond implicitly relying on standard benchmark practices for test sets or reference data. |
| Hardware Specification | No | The NAVER Smart Machine Learning (NSML) platform [50] has been used in the experiments. |
| Software Dependencies | Yes | We use the CLIP (Vi T-L/14) to extract image and text embedding vectors. |
| Experiment Setup | Yes | Without an explicit mention, we use the CLIP (Vi T-L/14) to extract image and text embedding vectors. Note that it is crucial to use double-precision for numerical stability. We found that λ of 5e-4 generally works across all benchmark evaluations, except for the FOIL benchmark where we used λ of 1e-15, which was slightly better. Note that we use an identical prompt 'A photo depicts' for all caption embeddings as employed in Ref CLIP-S [19]. |