BARTScore: Evaluating Generated Text as Text Generation
Authors: Weizhe Yuan, Graham Neubig, Pengfei Liu
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimentally, we evaluate different variants of BARTSCORE from 7 perspectives on 16 datasets. BARTSCORE achieves the best performance in 16 of 22 test settings against existing top-scoring metrics. |
| Researcher Affiliation | Academia | Weizhe Yuan Carnegie Mellon University weizhey@cs.cmu.edu Graham Neubig Carnegie Mellon University gneubig@cs.cmu.edu Pengfei Liu Carnegie Mellon University pliu3@cs.cmu.edu |
| Pseudocode | No | The paper does not include pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Code to calculate BARTScore is available at https://github.com/neulab/BARTScore |
| Open Datasets | Yes | The datasets we use are summarized in Tab. 1. We obtain the source language sentences, machine-translated texts and reference texts from the WMT19 metrics shared task [44]. (1) REALSumm [4] is a metaevaluation dataset for text summarization... (2) Summ Eval [13] is a collection of human judgments... (3) Ne R18 The NEWSROOM dataset [18]... (1) Rank19 [14] is used to meta-evaluate factuality metrics. (2) QAGS20 [66] collected 235 test outputs... BAGEL [45] provides information about restaurants. SFHOT [69] provides information about hotels... SFRES [69] provides information about restaurants... |
| Dataset Splits | Yes | For the machine translation task, due to the more expensive computational cost brought by larger text sets, we first use WMT18 [43] as a development set to search for one best prompt and obtain the phrase Such as , which is then used for the test language pairs. |
| Hardware Specification | Yes | We used two 2080Ti GPUs, and the training time is less than one hour. |
| Software Dependencies | No | The paper mentions "available off-the-shelf in Huggingface Transformers [70]" but does not specify a version number for the Transformers library or other software dependencies. |
| Experiment Setup | Yes | We used a random subset of 30,000 data and fine-tuned for one epoch with a batch size of 20 and a learning rate of 5e 5. |