reproducibilityindex.ai

Measuring Massive Multitask Language Understanding

Authors: Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We propose a new test to measure a text model s multitask accuracy. ... We ﬁnd that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average.
Researcher Affiliation	Academia	Dan Hendrycks UC Berkeley Collin Burns Columbia University Steven Basart UChicago Andy Zou UC Berkeley Mantas Mazeika UIUC Dawn Song UC Berkeley Jacob Steinhardt UC Berkeley
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	The test and code is available at github.com/hendrycks/test.
Open Datasets	Yes	The test and code is available at github.com/hendrycks/test.
Dataset Splits	Yes	We collected 15908 questions in total, which we split into a few-shot development set, a validation set, and a test set. The few-shot development set has 5 questions per subject, the validation set may be used for selecting hyperparameters and is made of 1540 questions, and the test set has 14079 questions.
Hardware Specification	No	The paper mentions using the OpenAI API for GPT-3 models and fine-tuning other models, but it does not specify the hardware (e.g., specific GPU models, CPUs) used to perform these experiments.
Software Dependencies	No	The paper mentions various models and frameworks used (e.g., GPT-3, Uniﬁed QA, T5, RoBERTa, ALBERT, GPT-2) but does not provide specific version numbers for software dependencies or libraries.
Experiment Setup	Yes	We begin each prompt with The following are multiple choice questions (with answers) about [subject]. For zero-shot evaluation, we append the question to the prompt. For few-shot evaluation, we add up to 5 demonstration examples with answers to the prompt before appending the question. All prompts end with Answer: . The model then produces probabilities for the tokens A, B, C, and D, and we treat the highest probability option as the prediction. For consistent evaluation, we create a dev set with 5 ﬁxed few-shot examples for each subject.