Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Language Models Are Better Than Humans at Next-token Prediction

Authors: Buck Shlegeris, Fabien Roger, Lawrence Chan, Euan McLean

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To try to answer this question, we performed two distinct experiments to directly compare humans and language models on this front: one measuring top-1 accuracy and the other measuring perplexity on Open Web Text.
Researcher Affiliation	Industry	Buck Shlegeris EMAIL Redwood Research; Fabien Roger EMAIL Redwood Research; Lawrence Chan EMAIL METR; Euan Mc Lean EMAIL FAR AI
Pseudocode	No	The paper describes the methods and procedures in prose, but does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/Fabien Roger/lm-game-analysis-main.
Open Datasets	Yes	To answer this question, we performed two experiments that directly compare humans to language models on next-token prediction, using the Open Web Text dataset (Gokaslan & Cohen, 2019).; ...on the first 256 tokens of 1024 randomly selected passages of the Pile (Gao et al., 2020) (train and validation set).
Dataset Splits	No	The paper mentions using the "validation set of Open Web Text" and the "train and val set of the pile" but does not specify the explicit split percentages or sample counts for these datasets.
Hardware Specification	No	The paper mentions evaluating various language models (GPT-Neo, GPT-J, GPT-3, GPT-2) and training a minimal model, but it does not provide any specific hardware details such as GPU models, CPU types, or memory used for these evaluations or training.
Software Dependencies	No	The paper references various language models like GPT-Neo, GPT-J, GPT-3, and GPT-2, and cites papers related to their development, but it does not specify any particular software libraries or tools with their version numbers that were used for their experiments.
Experiment Setup	Yes	The human participants were either staff or advisors of our lab, or members of the Bountied Rationality Facebook group. They were paid $30/hour. There were 60 participants overall. ... A total of 18530 guesses were made across participants. ... Our website did not have any way for participants to guess visually empty tokens, such as newlines and half-Unicode characters (displayed as <?> on the website). We excluded cases where the correct guess was impossible from our analysis. ... 54 humans participated. They were again either staff of our lab or members of the Bountied Rationality Facebook group, paid $15 for answering a set of 120 rounds (taking roughly 30 minutes overall per participant). ... On each round, a prompt c from the validation dataset of Open Web Text (Gokaslan & Cohen, 2019) is shown (with a maximum length of 120 tokens). ... In practice, we use GPT-2-small (117M parameters) as our generator language model G. ... As our human participants could only enter one of the 11 ratios in our interface (99%, 90%, 80%, 70%, 60%, 50%, 40%, 30%, 20%, 10%, or 1%).