On the Relation between Quality-Diversity Evaluation and Distribution-Fitting Goal in Text Generation

Authors: Jianing Li, Yanyan Lan, Jiafeng Guo, Xueqi Cheng

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we try to reveal such relation in a theoretical approach. We prove that under certain conditions, a linear combination of quality and diversity constitutes a divergence metric between the generated distribution and the real distribution. We also show that the commonly used BLEU/Self BLEU metric pair fails to match any divergence metric, thus propose CR/NRR as a substitute for quality/diversity metric pair. ... We show that BLEU-NSBLEU is signifcantly divergence-incompatible, by observing a phe nomenon that ground truth text data are clearly outper formed over both BLEU and NSBLEU by some manually constructed model. We also show that CR/NRR are repre sentative for quality/diversity evaluation respectively, while CND is representative for divergence evaluation.
Researcher Affiliation Academia 1CAS Key Laboratory of Network Data Science and Tech nology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China 2University of Chinese Academy of Sciences, Beijing, China.
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not provide a concrete statement or link for open-sourcing its code.
Open Datasets Yes We use two public datasets, MSCOCO Image Caption dataset (Chen et al., 2015) and EMNLP2017 WMT News dataset.
Dataset Splits No The paper describes how candidate and reference sets are used for evaluation (e.g., "We use 50,000 sentences as candidate set and another 50,000 as reference set for each dataset"), and how synthetic data or data for temperature-sweep experiments are generated. However, it does not provide specific train/validation/test splits in the conventional sense for reproducing model training, as its focus is on metric evaluation rather than training new models from scratch with a fixed dataset split.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. It mentions training models but not the hardware used.
Software Dependencies No The paper mentions using "SGD under the Tensorfow framework" and "Adam optimizer" but does not specify any version numbers for these software components, which is required for reproducibility.
Experiment Setup Yes We set λ = 2.0 in our experiments. So that QDisc = U(Q ) U(P ), and the denominator in DRate is also calculated through such optimization-based method. ... we set |V | = 4, L = 3, m = 1, n = 2, and apply SGD under the Tensorfow framework. ... We use 50,000 sentences as candidate set and another 50,000 as reference set for each dataset. ... The RNNLM consists of an embedding layer, an LSTM layer, and a fully-connected output layer. The embedding dimension and number of hidden nodes are all set to 128. We train the model using Adam (Kingma & Ba, 2014) optimizer with learning rate 0.001 by 30 epochs.