Let's Verify Step by Step

Authors: Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set.
Researcher Affiliation Industry Hunter Lightman , Vineet Kosaraju , Yura Burda , Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever & Karl Cobbe Open AI San Francisco, CA, USA karl@openai.com
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code Yes We release our full process supervision dataset, PRM800K, to promote related research. [...] The full PRM800K dataset is available at https://github.com/openai/prm800k.
Open Datasets Yes We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. [...] To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model. [...] The full PRM800K dataset is available at https://github.com/openai/prm800k.
Dataset Splits Yes To minimize overfitting, we include data from 4.5K MATH test problems in the PRM800K training set, and we therefore evaluate our models only on the remaining 500 MATH test problems.
Hardware Specification No The paper mentions using GPT-4 models and that small-scale models were pretrained with "roughly 200 times less compute", and refers to "supercomputing teams at Open AI", but does not provide specific hardware details like GPU or CPU models, or memory specifications.
Software Dependencies No The paper mentions using GPT-4 as a base model and the Math Mix dataset, but it does not provide specific version numbers for software dependencies such as programming languages, libraries, or frameworks.
Experiment Setup Yes We only train for a single epoch on each dataset of model samples and reward model labels, without dropout, and without jointly learning a language modeling objective. [...] All of our PRMs are trained for 2 epochs.