reproducibilityindex.ai

Let's Verify Step by Step

Authors: Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set.
Researcher Affiliation	Industry	Hunter Lightman , Vineet Kosaraju , Yura Burda , Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever & Karl Cobbe Open AI San Francisco, CA, USA karl@openai.com
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	We release our full process supervision dataset, PRM800K, to promote related research. [...] The full PRM800K dataset is available at https://github.com/openai/prm800k.
Open Datasets	Yes	We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. [...] To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model. [...] The full PRM800K dataset is available at https://github.com/openai/prm800k.
Dataset Splits	Yes	To minimize overfitting, we include data from 4.5K MATH test problems in the PRM800K training set, and we therefore evaluate our models only on the remaining 500 MATH test problems.
Hardware Specification	No	The paper mentions using GPT-4 models and that small-scale models were pretrained with "roughly 200 times less compute", and refers to "supercomputing teams at Open AI", but does not provide specific hardware details like GPU or CPU models, or memory specifications.
Software Dependencies	No	The paper mentions using GPT-4 as a base model and the Math Mix dataset, but it does not provide specific version numbers for software dependencies such as programming languages, libraries, or frameworks.
Experiment Setup	Yes	We only train for a single epoch on each dataset of model samples and reward model labels, without dropout, and without jointly learning a language modeling objective. [...] All of our PRMs are trained for 2 epochs.