AlphaMath Almost Zero: Process Supervision without Process
Authors: Guoxin Chen, Minpeng Liao, Chengxi Li, Kai Fan
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experimental results on both in-domain and out-of-domain datasets demonstrate that even without GPT-4 or human-annotated process supervision, our Alpha Math framework achieves comparable or superior results to previous state-of-the-art methods. |
| Researcher Affiliation | Collaboration | Guoxin Chen , Minpeng Liao , Chengxi Li , Kai Fan Tongyi Lab chenguoxin22@mails.ucas.ac.cn {minpeng.lmp,xiji.lcx,k.fan}@alibaba-inc.com |
| Pseudocode | Yes | Algorithm 1: Inference with MCTS; Algorithm 2: Step-level Beam Search |
| Open Source Code | Yes | Code: https://github.com/MARIO-Math-Reasoning/Super_MARIO |
| Open Datasets | Yes | For the training sets, we exclusively extract question and answer pairs from GSM8K [7] and MATH [15], omitting the human-annotated solution analysis. |
| Dataset Splits | No | The paper specifies training and test sets but does not explicitly mention or provide details for a separate validation dataset split. |
| Hardware Specification | Yes | All experiments were conducted on Ubuntu 22.04 equipped with 8 * NVIDIA A100 GPUs. |
| Software Dependencies | Yes | Our code mainly depends on Python 3.114 and Py Torch 2.1.25. ... We trained all models with Deep Speed Ze RO Stage2 [29] and Flash-Attention 2 [9]. |
| Experiment Setup | Yes | For supervised fine-tuning, we set the learning rate of 4e-5, batch size of 1024, the weight of the value loss to 0.01 or 0.0005 (for Llama3 [11]), and train the model for 10 epochs. We employ the Adam W optimizer [24] and the cosine learning rate scheduler with the warmup rate set to 0.03. Table 6 provides key hyperparameters of Alpha Math. |