SMART: Sentences as Basic Units for Text Evaluation
Authors: Reinald Kim Amplayo, Peter J Liu, Yao Zhao, Shashi Narayan
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics on the Summ Eval summarization meta-evaluation dataset, while the same metric with a string-based matching function is competitive with current modelbased metrics. |
| Researcher Affiliation | Industry | Reinald Kim Amplayo, Peter J. Liu, Yao Zhao, Shashi Narayan Google Research {reinald, peterjliu, yaozhaoyz, shashinarayan}@google.com |
| Pseudocode | Yes | Figure 2: Python pseudocode of the soft version of Longest Common Subsequence (Soft-LCS) given two sets of summary sentences X and Y. |
| Open Source Code | Yes | 1github.com/google-research/google-research/tree/master/smart_eval |
| Open Datasets | Yes | We conducted experiments on the Summ Eval dataset (Fabbri et al., 2021), a document summarization meta-evaluation suite consisting of summaries from the CNN/DM dataset (Hermann et al., 2015). We conducted experiments on the TRUE benchmark (Honovich et al., 2022)... |
| Dataset Splits | No | The paper mentions datasets like Summ Eval and TRUE benchmark, and refers to a 'development set' for the TRUE benchmark, but it does not specify concrete training, validation, or test split percentages or counts for its experiments. |
| Hardware Specification | No | The paper mentions the general use of 'accelerators (GPUs/TPUs)' but does not provide specific details such as GPU models (e.g., NVIDIA A100), CPU models, or cloud computing instance types. |
| Software Dependencies | No | The paper mentions 'Sentences are split using nltk' and refers to a 'BLEURT-20 checkpoint' and 'T5-11B', but it does not provide specific version numbers for software libraries like NLTK or other dependencies beyond model checkpoints. |
| Experiment Setup | Yes | For BLEURT (Sellam et al., 2020), we used the BLEURT-20 checkpoint suggested by the authors which also supports non-English languages. For ANLI, we used the same implementation as in Honovich et al. (2022), where T5-11B is fine-tuned with 25K training steps on ANLI (Nie et al., 2020), treating both contradiction and neutral pairs as not entailed. |