SMART: Sentences as Basic Units for Text Evaluation

Authors: Reinald Kim Amplayo, Peter J Liu, Yao Zhao, Shashi Narayan

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics on the Summ Eval summarization meta-evaluation dataset, while the same metric with a string-based matching function is competitive with current modelbased metrics.
Researcher Affiliation Industry Reinald Kim Amplayo, Peter J. Liu, Yao Zhao, Shashi Narayan Google Research {reinald, peterjliu, yaozhaoyz, shashinarayan}@google.com
Pseudocode Yes Figure 2: Python pseudocode of the soft version of Longest Common Subsequence (Soft-LCS) given two sets of summary sentences X and Y.
Open Source Code Yes 1github.com/google-research/google-research/tree/master/smart_eval
Open Datasets Yes We conducted experiments on the Summ Eval dataset (Fabbri et al., 2021), a document summarization meta-evaluation suite consisting of summaries from the CNN/DM dataset (Hermann et al., 2015). We conducted experiments on the TRUE benchmark (Honovich et al., 2022)...
Dataset Splits No The paper mentions datasets like Summ Eval and TRUE benchmark, and refers to a 'development set' for the TRUE benchmark, but it does not specify concrete training, validation, or test split percentages or counts for its experiments.
Hardware Specification No The paper mentions the general use of 'accelerators (GPUs/TPUs)' but does not provide specific details such as GPU models (e.g., NVIDIA A100), CPU models, or cloud computing instance types.
Software Dependencies No The paper mentions 'Sentences are split using nltk' and refers to a 'BLEURT-20 checkpoint' and 'T5-11B', but it does not provide specific version numbers for software libraries like NLTK or other dependencies beyond model checkpoints.
Experiment Setup Yes For BLEURT (Sellam et al., 2020), we used the BLEURT-20 checkpoint suggested by the authors which also supports non-English languages. For ANLI, we used the same implementation as in Honovich et al. (2022), where T5-11B is fine-tuned with 25K training steps on ANLI (Nie et al., 2020), treating both contradiction and neutral pairs as not entailed.