Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Measuring Massive Multitask Language Understanding
Authors: Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, Jacob Steinhardt
ICLR 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We propose a new test to measure a text model s multitask accuracy. ... We find that while most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average. |
| Researcher Affiliation | Academia | Dan Hendrycks UC Berkeley Collin Burns Columbia University Steven Basart UChicago Andy Zou UC Berkeley Mantas Mazeika UIUC Dawn Song UC Berkeley Jacob Steinhardt UC Berkeley |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The test and code is available at github.com/hendrycks/test. |
| Open Datasets | Yes | The test and code is available at github.com/hendrycks/test. |
| Dataset Splits | Yes | We collected 15908 questions in total, which we split into a few-shot development set, a validation set, and a test set. The few-shot development set has 5 questions per subject, the validation set may be used for selecting hyperparameters and is made of 1540 questions, and the test set has 14079 questions. |
| Hardware Specification | No | The paper mentions using the OpenAI API for GPT-3 models and fine-tuning other models, but it does not specify the hardware (e.g., specific GPU models, CPUs) used to perform these experiments. |
| Software Dependencies | No | The paper mentions various models and frameworks used (e.g., GPT-3, Unified QA, T5, RoBERTa, ALBERT, GPT-2) but does not provide specific version numbers for software dependencies or libraries. |
| Experiment Setup | Yes | We begin each prompt with The following are multiple choice questions (with answers) about [subject]. For zero-shot evaluation, we append the question to the prompt. For few-shot evaluation, we add up to 5 demonstration examples with answers to the prompt before appending the question. All prompts end with Answer: . The model then produces probabilities for the tokens A, B, C, and D, and we treat the highest probability option as the prediction. For consistent evaluation, we create a dev set with 5 fixed few-shot examples for each subject. |