reproducibilityindex.ai

Mutual Information Divergence: A Unified Metric for Multimodal Generative Models

Authors: Jin-Hwa Kim, Yunji Kim, Jiyoung Lee, Kang Min Yoo, Sang-Woo Lee

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To validate, we extensively compare it with competing metrics using carefully-generated or human-annotated judgments in text-to-image generation and image captioning tasks. The proposed MID significantly outperforms the competitive methods by having consistency across benchmarks, sample parsimony, and robustness toward the exploited CLIP model.
Researcher Affiliation	Collaboration	Jin-Hwa Kim NAVER AI Lab, SNU AIIS Republic of Korea j1nhwa.kim@navercorp.com Yunji Kim Jiyoung Lee NAVER AI Lab Republic of Korea {yunji.kim,lee.j}@navercorp.com Kang Min Yoo NAVER AI Lab, CLOVA, SNU AIIS Republic of Korea kangmin.yoo@navercorp.com Sang-Woo Lee NAVER CLOVA, AI Lab, KAIST AI Republic of Korea sang.woo.lee@navercorp.com
Pseudocode	No	No pseudocode or algorithm blocks are present in the paper.
Open Source Code	Yes	The code is available at https://github.com/naver-ai/mid.metric.
Open Datasets	Yes	COCO dataset [2], CUB [28, 29] and Flowers [30], Flickr8K-Expert [48], Flickr8K-CF [48], Pascal-50S [15], FOIL-COCO [40]
Dataset Splits	No	The paper evaluates existing models on established benchmarks and human judgment datasets, but does not explicitly detail the training/validation dataset splits used for the specific experimental setup of calculating their metric, beyond implicitly relying on standard benchmark practices for test sets or reference data.
Hardware Specification	No	The NAVER Smart Machine Learning (NSML) platform [50] has been used in the experiments.
Software Dependencies	Yes	We use the CLIP (Vi T-L/14) to extract image and text embedding vectors.
Experiment Setup	Yes	Without an explicit mention, we use the CLIP (Vi T-L/14) to extract image and text embedding vectors. Note that it is crucial to use double-precision for numerical stability. We found that λ of 5e-4 generally works across all benchmark evaluations, except for the FOIL benchmark where we used λ of 1e-15, which was slightly better. Note that we use an identical prompt 'A photo depicts' for all caption embeddings as employed in Ref CLIP-S [19].