reproducibilityindex.ai

SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

Authors: Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, Samuel Bowman

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper we present Super GLUE, a new benchmark styled after GLUE with a new set of more difﬁcult language understanding tasks, a software toolkit, and a public leaderboard. We evaluate BERT-based baselines and ﬁnd that they still lag behind humans by nearly 20 points.
Researcher Affiliation	Collaboration	Alex Wang New York University Yada Pruksachatkun New York University Nikita Nangia New York University Amanpreet Singh Facebook AI Research Julian Michael University of Washington Felix Hill Deep Mind Omer Levy Facebook AI Research Samuel R. Bowman New York University
Pseudocode	No	No structured pseudocode or algorithm blocks were found.
Open Source Code	Yes	To facilitate using Super GLUE, we release jiant (Wang et al., 2019b),4 a modular software toolkit, built with Py Torch (Paszke et al., 2017), components from Allen NLP (Gardner et al., 2017), and the transformers package.5 [Footnote 4: https://github.com/nyu-mll/jiant]
Open Datasets	Yes	Public data: We require that tasks have existing public training data in order to minimize the risks involved in newly-created datasets. We also prefer tasks for which we have access to (or could create) a test set with private labels. Bool Q (Boolean Questions, Clark et al., 2019a) is a QA task where each example consists of a short passage and a yes/no question about the passage. The questions are provided anonymously and unsolicited by users of the Google search engine, and afterwards paired with a paragraph from a Wikipedia article containing the answer.
Dataset Splits	Yes	Table 1: The tasks included in Super GLUE. WSD stands for word sense disambiguation, NLI is natural language inference, coref. is coreference resolution, and QA is question answering. For Multi RC, we list the number of total answers for 456/83/166 train/dev/test questions. Corpus \|Train\| \|Dev\| \|Test\| Task Metrics Text Sources (e.g., Bool Q 9427 3270 3245)
Hardware Specification	Yes	We gratefully acknowledge the support of the NVIDIA Corporation with the donation of a Titan V GPU used at NYU for this research, and funding from Deep Mind for the hosting of the benchmark platform.
Software Dependencies	No	No specific version numbers for software dependencies were provided. The paper mentions 'Py Torch (Paszke et al., 2017), components from Allen NLP (Gardner et al., 2017), and the transformers package'.
Experiment Setup	Yes	For training, we use the procedure speciﬁed in Devlin et al. (2019): We use Adam (Kingma and Ba, 2014) with an initial learning rate of 10 5 and ﬁne-tune for a maximum of 10 epochs.