Let's Verify Step by Step
Authors: Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, Karl Cobbe
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. Our process-supervised model solves 78% of problems from a representative subset of the MATH test set. |
| Researcher Affiliation | Industry | Hunter Lightman , Vineet Kosaraju , Yura Burda , Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever & Karl Cobbe Open AI San Francisco, CA, USA karl@openai.com |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | We release our full process supervision dataset, PRM800K, to promote related research. [...] The full PRM800K dataset is available at https://github.com/openai/prm800k. |
| Open Datasets | Yes | We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset. [...] To support related research, we also release PRM800K, the complete dataset of 800,000 step-level human feedback labels used to train our best reward model. [...] The full PRM800K dataset is available at https://github.com/openai/prm800k. |
| Dataset Splits | Yes | To minimize overfitting, we include data from 4.5K MATH test problems in the PRM800K training set, and we therefore evaluate our models only on the remaining 500 MATH test problems. |
| Hardware Specification | No | The paper mentions using GPT-4 models and that small-scale models were pretrained with "roughly 200 times less compute", and refers to "supercomputing teams at Open AI", but does not provide specific hardware details like GPU or CPU models, or memory specifications. |
| Software Dependencies | No | The paper mentions using GPT-4 as a base model and the Math Mix dataset, but it does not provide specific version numbers for software dependencies such as programming languages, libraries, or frameworks. |
| Experiment Setup | Yes | We only train for a single epoch on each dataset of model samples and reward model labels, without dropout, and without jointly learning a language modeling objective. [...] All of our PRMs are trained for 2 epochs. |