Mutual Information Divergence: A Unified Metric for Multimodal Generative Models

Authors: Jin-Hwa Kim, Yunji Kim, Jiyoung Lee, Kang Min Yoo, Sang-Woo Lee

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To validate, we extensively compare it with competing metrics using carefully-generated or human-annotated judgments in text-to-image generation and image captioning tasks. The proposed MID significantly outperforms the competitive methods by having consistency across benchmarks, sample parsimony, and robustness toward the exploited CLIP model.
Researcher Affiliation Collaboration Jin-Hwa Kim NAVER AI Lab, SNU AIIS Republic of Korea j1nhwa.kim@navercorp.com Yunji Kim Jiyoung Lee NAVER AI Lab Republic of Korea {yunji.kim,lee.j}@navercorp.com Kang Min Yoo NAVER AI Lab, CLOVA, SNU AIIS Republic of Korea kangmin.yoo@navercorp.com Sang-Woo Lee NAVER CLOVA, AI Lab, KAIST AI Republic of Korea sang.woo.lee@navercorp.com
Pseudocode No No pseudocode or algorithm blocks are present in the paper.
Open Source Code Yes The code is available at https://github.com/naver-ai/mid.metric.
Open Datasets Yes COCO dataset [2], CUB [28, 29] and Flowers [30], Flickr8K-Expert [48], Flickr8K-CF [48], Pascal-50S [15], FOIL-COCO [40]
Dataset Splits No The paper evaluates existing models on established benchmarks and human judgment datasets, but does not explicitly detail the training/validation dataset splits used for the specific experimental setup of calculating their metric, beyond implicitly relying on standard benchmark practices for test sets or reference data.
Hardware Specification No The NAVER Smart Machine Learning (NSML) platform [50] has been used in the experiments.
Software Dependencies Yes We use the CLIP (Vi T-L/14) to extract image and text embedding vectors.
Experiment Setup Yes Without an explicit mention, we use the CLIP (Vi T-L/14) to extract image and text embedding vectors. Note that it is crucial to use double-precision for numerical stability. We found that λ of 5e-4 generally works across all benchmark evaluations, except for the FOIL benchmark where we used λ of 1e-15, which was slightly better. Note that we use an identical prompt 'A photo depicts' for all caption embeddings as employed in Ref CLIP-S [19].