reproducibilityindex.ai

BERTScore: Evaluating Text Generation with BERT

Authors: Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, Yoav Artzi

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate using the outputs of 363 machine translation and image captioning systems. BERTSCORE correlates better with human judgments and provides stronger model selection performance than existing metrics.
Researcher Affiliation	Collaboration	Department of Computer Science and Cornell Tech, Cornell University {vk352, fw245, kilian}@cornell.edu {yoav}@cs.cornell.edu ASAPP Inc. tzhang@asapp.com
Pseudocode	No	The paper describes the computation process in prose and via an illustration (Figure 1), but it does not contain a formal 'Pseudocode' or 'Algorithm' block.
Open Source Code	Yes	The code for BERTSCORE is available at https://github.com/Tiiiger/bert score.
Open Datasets	Yes	Our main evaluation corpus is the WMT18 metric evaluation dataset (Ma et al., 2018)... We use the human judgments of twelve submission entries from the COCO 2015 Captioning Challenge... We use the Quora Question Pair corpus (QQP; Iyer et al., 2017) and the adversarial paraphrases from the Paraphrase Adversaries from Word Scrambling dataset (PAWS; Zhang et al., 2019).
Dataset Splits	Yes	We use the WMT16 dataset (Bojar et al., 2016) as a validation set to select the best layer of each model (Appendix B).
Hardware Specification	Yes	Despite the use of a large pre-trained model, computing BERTSCORE is relatively fast. We are able to process 192.5 candidate-reference pairs/second using a GTX-1080Ti GPU.
Software Dependencies	No	The paper mentions using 'Hugging Face models that use the GPT-2 tokenizer' and models 'from https://github.com/huggingface/pytorch-transformers', as well as the 'fairseq library', but it does not specify exact version numbers for these software dependencies.
Experiment Setup	Yes	We use the 24-layer Ro BERTalarge model for English tasks, 12-layer BERTchinese model for Chinese tasks, and the 12-layer cased multilingual BERTmulti model for other languages... We use the WMT16 dataset... as a validation set to select the best layer of each model (Appendix B). Table 8 also lists specific 'Best Layer' choices for various models.