Evaluating Natural Language Generation via Unbalanced Optimal Transport

Authors: Yimeng Chen, Yanyan Lan, Ruinbin Xiong, Liang Pang, Zhiming Ma, Xueqi Cheng

IJCAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on WMT18 and WMT19 show that our proposed metrics have the ability to produce more consistent evaluation results with human judgements, as compared with existing intrinsic metrics.
Researcher Affiliation Academia Yimeng Chen1,3 , Yanyan Lan1,2 , Ruibin Xiong1,2 , Liang Pang1,2 , Zhiming Ma1,3 and Xueqi Cheng1,2 1University of Chinese Academy of Sciences 2CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, CAS 3Academy of Mathematics and Systems Science, CAS
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures).
Open Source Code Yes Code is available at https://github.com/Beastlyprime/lazy-emd
Open Datasets Yes Our experiments are conducted on WMT18 [Ma et al., 2018] and WMT19 [Ma et al., 2019], two widely used machine translation datasets for evaluating NLG measures.
Dataset Splits No The paper does not explicitly provide specific training/test/validation dataset splits in the conventional machine learning sense for model training. While parameters are tuned on specific language pairs (et-en, en-zh, en-cs), these are not referred to as a general 'validation split' for a single model across all experiments.
Hardware Specification No The paper does not provide specific hardware details (exact GPU/CPU models, processor types with speeds, memory amounts, or detailed computer specifications) used for running its experiments.
Software Dependencies No The paper mentions "BERTScore v0.2.2" and "python package POT" but does not specify a version number for POT. This is not sufficient to provide specific version numbers for all key ancillary software.
Experiment Setup Yes The regularization parameter in the Sinkhorn-scaling algorithm is set as 0.009. The penalty parameters are set to be different for three data categories, based on the target language of the translation, i.e., English, Chinese and others. For English, the parameter is set to (0.23, 0.31), which is tuned on et-en in WMT18. For Chinese, the parameter is set as (0.018, 0.97), which is tuned on en-zh in WMT19. For other languages, the parameter is set as (0.009, 0.95), which is tuned on en-cs in WMT19.