SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

Authors: Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, Samuel Bowman

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper we present Super GLUE, a new benchmark styled after GLUE with a new set of more difficult language understanding tasks, a software toolkit, and a public leaderboard. We evaluate BERT-based baselines and find that they still lag behind humans by nearly 20 points.
Researcher Affiliation Collaboration Alex Wang New York University Yada Pruksachatkun New York University Nikita Nangia New York University Amanpreet Singh Facebook AI Research Julian Michael University of Washington Felix Hill Deep Mind Omer Levy Facebook AI Research Samuel R. Bowman New York University
Pseudocode No No structured pseudocode or algorithm blocks were found.
Open Source Code Yes To facilitate using Super GLUE, we release jiant (Wang et al., 2019b),4 a modular software toolkit, built with Py Torch (Paszke et al., 2017), components from Allen NLP (Gardner et al., 2017), and the transformers package.5 [Footnote 4: https://github.com/nyu-mll/jiant]
Open Datasets Yes Public data: We require that tasks have existing public training data in order to minimize the risks involved in newly-created datasets. We also prefer tasks for which we have access to (or could create) a test set with private labels. Bool Q (Boolean Questions, Clark et al., 2019a) is a QA task where each example consists of a short passage and a yes/no question about the passage. The questions are provided anonymously and unsolicited by users of the Google search engine, and afterwards paired with a paragraph from a Wikipedia article containing the answer.
Dataset Splits Yes Table 1: The tasks included in Super GLUE. WSD stands for word sense disambiguation, NLI is natural language inference, coref. is coreference resolution, and QA is question answering. For Multi RC, we list the number of total answers for 456/83/166 train/dev/test questions. Corpus |Train| |Dev| |Test| Task Metrics Text Sources (e.g., Bool Q 9427 3270 3245)
Hardware Specification Yes We gratefully acknowledge the support of the NVIDIA Corporation with the donation of a Titan V GPU used at NYU for this research, and funding from Deep Mind for the hosting of the benchmark platform.
Software Dependencies No No specific version numbers for software dependencies were provided. The paper mentions 'Py Torch (Paszke et al., 2017), components from Allen NLP (Gardner et al., 2017), and the transformers package'.
Experiment Setup Yes For training, we use the procedure specified in Devlin et al. (2019): We use Adam (Kingma and Ba, 2014) with an initial learning rate of 10 5 and fine-tune for a maximum of 10 epochs.