BERTScore: Evaluating Text Generation with BERT

Authors: Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, Yoav Artzi

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate using the outputs of 363 machine translation and image captioning systems. BERTSCORE correlates better with human judgments and provides stronger model selection performance than existing metrics.
Researcher Affiliation Collaboration Department of Computer Science and Cornell Tech, Cornell University {vk352, fw245, kilian}@cornell.edu {yoav}@cs.cornell.edu ASAPP Inc. tzhang@asapp.com
Pseudocode No The paper describes the computation process in prose and via an illustration (Figure 1), but it does not contain a formal 'Pseudocode' or 'Algorithm' block.
Open Source Code Yes The code for BERTSCORE is available at https://github.com/Tiiiger/bert score.
Open Datasets Yes Our main evaluation corpus is the WMT18 metric evaluation dataset (Ma et al., 2018)... We use the human judgments of twelve submission entries from the COCO 2015 Captioning Challenge... We use the Quora Question Pair corpus (QQP; Iyer et al., 2017) and the adversarial paraphrases from the Paraphrase Adversaries from Word Scrambling dataset (PAWS; Zhang et al., 2019).
Dataset Splits Yes We use the WMT16 dataset (Bojar et al., 2016) as a validation set to select the best layer of each model (Appendix B).
Hardware Specification Yes Despite the use of a large pre-trained model, computing BERTSCORE is relatively fast. We are able to process 192.5 candidate-reference pairs/second using a GTX-1080Ti GPU.
Software Dependencies No The paper mentions using 'Hugging Face models that use the GPT-2 tokenizer' and models 'from https://github.com/huggingface/pytorch-transformers', as well as the 'fairseq library', but it does not specify exact version numbers for these software dependencies.
Experiment Setup Yes We use the 24-layer Ro BERTalarge model for English tasks, 12-layer BERTchinese model for Chinese tasks, and the 12-layer cased multilingual BERTmulti model for other languages... We use the WMT16 dataset... as a validation set to select the best layer of each model (Appendix B). Table 8 also lists specific 'Best Layer' choices for various models.