reproducibilityindex.ai

BARTScore: Evaluating Generated Text as Text Generation

Authors: Weizhe Yuan, Graham Neubig, Pengfei Liu

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimentally, we evaluate different variants of BARTSCORE from 7 perspectives on 16 datasets. BARTSCORE achieves the best performance in 16 of 22 test settings against existing top-scoring metrics.
Researcher Affiliation	Academia	Weizhe Yuan Carnegie Mellon University weizhey@cs.cmu.edu Graham Neubig Carnegie Mellon University gneubig@cs.cmu.edu Pengfei Liu Carnegie Mellon University pliu3@cs.cmu.edu
Pseudocode	No	The paper does not include pseudocode or clearly labeled algorithm blocks.
Open Source Code	Yes	Code to calculate BARTScore is available at https://github.com/neulab/BARTScore
Open Datasets	Yes	The datasets we use are summarized in Tab. 1. We obtain the source language sentences, machine-translated texts and reference texts from the WMT19 metrics shared task [44]. (1) REALSumm [4] is a metaevaluation dataset for text summarization... (2) Summ Eval [13] is a collection of human judgments... (3) Ne R18 The NEWSROOM dataset [18]... (1) Rank19 [14] is used to meta-evaluate factuality metrics. (2) QAGS20 [66] collected 235 test outputs... BAGEL [45] provides information about restaurants. SFHOT [69] provides information about hotels... SFRES [69] provides information about restaurants...
Dataset Splits	Yes	For the machine translation task, due to the more expensive computational cost brought by larger text sets, we first use WMT18 [43] as a development set to search for one best prompt and obtain the phrase Such as , which is then used for the test language pairs.
Hardware Specification	Yes	We used two 2080Ti GPUs, and the training time is less than one hour.
Software Dependencies	No	The paper mentions "available off-the-shelf in Huggingface Transformers [70]" but does not specify a version number for the Transformers library or other software dependencies.
Experiment Setup	Yes	We used a random subset of 30,000 data and fine-tuned for one epoch with a batch size of 20 and a learning rate of 5e 5.