Measuring Massive Multitask Language Understanding

Authors: Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt

ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We propose a new test to measure a text model s multitask accuracy. ... We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average.
Researcher Affiliation Academia Dan Hendrycks UC Berkeley Collin Burns Columbia University Steven Basart UChicago Andy Zou UC Berkeley Mantas Mazeika UIUC Dawn Song UC Berkeley Jacob Steinhardt UC Berkeley
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes The test and code is available at github.com/hendrycks/test.
Open Datasets Yes The test and code is available at github.com/hendrycks/test.
Dataset Splits Yes We collected 15908 questions in total, which we split into a few-shot development set, a validation set, and a test set. The few-shot development set has 5 questions per subject, the validation set may be used for selecting hyperparameters and is made of 1540 questions, and the test set has 14079 questions.
Hardware Specification No The paper mentions using the OpenAI API for GPT-3 models and fine-tuning other models, but it does not specify the hardware (e.g., specific GPU models, CPUs) used to perform these experiments.
Software Dependencies No The paper mentions various models and frameworks used (e.g., GPT-3, Unified QA, T5, RoBERTa, ALBERT, GPT-2) but does not provide specific version numbers for software dependencies or libraries.
Experiment Setup Yes We begin each prompt with The following are multiple choice questions (with answers) about [subject]. For zero-shot evaluation, we append the question to the prompt. For few-shot evaluation, we add up to 5 demonstration examples with answers to the prompt before appending the question. All prompts end with Answer: . The model then produces probabilities for the tokens A, B, C, and D, and we treat the highest probability option as the prediction. For consistent evaluation, we create a dev set with 5 fixed few-shot examples for each subject.