Measuring Massive Multitask Language Understanding
Authors: Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose a new test to measure a text model s multitask accuracy. ... We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. |
| Researcher Affiliation | Academia | Dan Hendrycks UC Berkeley Collin Burns Columbia University Steven Basart UChicago Andy Zou UC Berkeley Mantas Mazeika UIUC Dawn Song UC Berkeley Jacob Steinhardt UC Berkeley |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The test and code is available at github.com/hendrycks/test. |
| Open Datasets | Yes | The test and code is available at github.com/hendrycks/test. |
| Dataset Splits | Yes | We collected 15908 questions in total, which we split into a few-shot development set, a validation set, and a test set. The few-shot development set has 5 questions per subject, the validation set may be used for selecting hyperparameters and is made of 1540 questions, and the test set has 14079 questions. |
| Hardware Specification | No | The paper mentions using the OpenAI API for GPT-3 models and fine-tuning other models, but it does not specify the hardware (e.g., specific GPU models, CPUs) used to perform these experiments. |
| Software Dependencies | No | The paper mentions various models and frameworks used (e.g., GPT-3, Unified QA, T5, RoBERTa, ALBERT, GPT-2) but does not provide specific version numbers for software dependencies or libraries. |
| Experiment Setup | Yes | We begin each prompt with The following are multiple choice questions (with answers) about [subject]. For zero-shot evaluation, we append the question to the prompt. For few-shot evaluation, we add up to 5 demonstration examples with answers to the prompt before appending the question. All prompts end with Answer: . The model then produces probabilities for the tokens A, B, C, and D, and we treat the highest probability option as the prediction. For consistent evaluation, we create a dev set with 5 fixed few-shot examples for each subject. |