BARTScore: Evaluating Generated Text as Text Generation

Authors: Weizhe Yuan, Graham Neubig, Pengfei Liu

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimentally, we evaluate different variants of BARTSCORE from 7 perspectives on 16 datasets. BARTSCORE achieves the best performance in 16 of 22 test settings against existing top-scoring metrics.
Researcher Affiliation Academia Weizhe Yuan Carnegie Mellon University weizhey@cs.cmu.edu Graham Neubig Carnegie Mellon University gneubig@cs.cmu.edu Pengfei Liu Carnegie Mellon University pliu3@cs.cmu.edu
Pseudocode No The paper does not include pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Code to calculate BARTScore is available at https://github.com/neulab/BARTScore
Open Datasets Yes The datasets we use are summarized in Tab. 1. We obtain the source language sentences, machine-translated texts and reference texts from the WMT19 metrics shared task [44]. (1) REALSumm [4] is a metaevaluation dataset for text summarization... (2) Summ Eval [13] is a collection of human judgments... (3) Ne R18 The NEWSROOM dataset [18]... (1) Rank19 [14] is used to meta-evaluate factuality metrics. (2) QAGS20 [66] collected 235 test outputs... BAGEL [45] provides information about restaurants. SFHOT [69] provides information about hotels... SFRES [69] provides information about restaurants...
Dataset Splits Yes For the machine translation task, due to the more expensive computational cost brought by larger text sets, we first use WMT18 [43] as a development set to search for one best prompt and obtain the phrase Such as , which is then used for the test language pairs.
Hardware Specification Yes We used two 2080Ti GPUs, and the training time is less than one hour.
Software Dependencies No The paper mentions "available off-the-shelf in Huggingface Transformers [70]" but does not specify a version number for the Transformers library or other software dependencies.
Experiment Setup Yes We used a random subset of 30,000 data and fine-tuned for one epoch with a batch size of 20 and a learning rate of 5e 5.