reproducibilityindex.ai

SMART: Sentences as Basic Units for Text Evaluation

Authors: Reinald Kim Amplayo, Peter J Liu, Yao Zhao, Shashi Narayan

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics on the Summ Eval summarization meta-evaluation dataset, while the same metric with a string-based matching function is competitive with current modelbased metrics.
Researcher Affiliation	Industry	Reinald Kim Amplayo, Peter J. Liu, Yao Zhao, Shashi Narayan Google Research {reinald, peterjliu, yaozhaoyz, shashinarayan}@google.com
Pseudocode	Yes	Figure 2: Python pseudocode of the soft version of Longest Common Subsequence (Soft-LCS) given two sets of summary sentences X and Y.
Open Source Code	Yes	1github.com/google-research/google-research/tree/master/smart_eval
Open Datasets	Yes	We conducted experiments on the Summ Eval dataset (Fabbri et al., 2021), a document summarization meta-evaluation suite consisting of summaries from the CNN/DM dataset (Hermann et al., 2015). We conducted experiments on the TRUE benchmark (Honovich et al., 2022)...
Dataset Splits	No	The paper mentions datasets like Summ Eval and TRUE benchmark, and refers to a 'development set' for the TRUE benchmark, but it does not specify concrete training, validation, or test split percentages or counts for its experiments.
Hardware Specification	No	The paper mentions the general use of 'accelerators (GPUs/TPUs)' but does not provide specific details such as GPU models (e.g., NVIDIA A100), CPU models, or cloud computing instance types.
Software Dependencies	No	The paper mentions 'Sentences are split using nltk' and refers to a 'BLEURT-20 checkpoint' and 'T5-11B', but it does not provide specific version numbers for software libraries like NLTK or other dependencies beyond model checkpoints.
Experiment Setup	Yes	For BLEURT (Sellam et al., 2020), we used the BLEURT-20 checkpoint suggested by the authors which also supports non-English languages. For ANLI, we used the same implementation as in Honovich et al. (2022), where T5-11B is ﬁne-tuned with 25K training steps on ANLI (Nie et al., 2020), treating both contradiction and neutral pairs as not entailed.