Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SMART: Sentences as Basic Units for Text Evaluation
Authors: Reinald Kim Amplayo, Peter J Liu, Yao Zhao, Shashi Narayan
ICLR 2023 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics on the Summ Eval summarization meta-evaluation dataset, while the same metric with a string-based matching function is competitive with current modelbased metrics. |
| Researcher Affiliation | Industry | Reinald Kim Amplayo, Peter J. Liu, Yao Zhao, Shashi Narayan Google Research EMAIL |
| Pseudocode | Yes | Figure 2: Python pseudocode of the soft version of Longest Common Subsequence (Soft-LCS) given two sets of summary sentences X and Y. |
| Open Source Code | Yes | 1github.com/google-research/google-research/tree/master/smart_eval |
| Open Datasets | Yes | We conducted experiments on the Summ Eval dataset (Fabbri et al., 2021), a document summarization meta-evaluation suite consisting of summaries from the CNN/DM dataset (Hermann et al., 2015). We conducted experiments on the TRUE benchmark (Honovich et al., 2022)... |
| Dataset Splits | No | The paper mentions datasets like Summ Eval and TRUE benchmark, and refers to a 'development set' for the TRUE benchmark, but it does not specify concrete training, validation, or test split percentages or counts for its experiments. |
| Hardware Specification | No | The paper mentions the general use of 'accelerators (GPUs/TPUs)' but does not provide specific details such as GPU models (e.g., NVIDIA A100), CPU models, or cloud computing instance types. |
| Software Dependencies | No | The paper mentions 'Sentences are split using nltk' and refers to a 'BLEURT-20 checkpoint' and 'T5-11B', but it does not provide specific version numbers for software libraries like NLTK or other dependencies beyond model checkpoints. |
| Experiment Setup | Yes | For BLEURT (Sellam et al., 2020), we used the BLEURT-20 checkpoint suggested by the authors which also supports non-English languages. For ANLI, we used the same implementation as in Honovich et al. (2022), where T5-11B is ο¬ne-tuned with 25K training steps on ANLI (Nie et al., 2020), treating both contradiction and neutral pairs as not entailed. |