MAmmoTH2: Scaling Instructions from the Web
Authors: Xiang Yue, Tianyu Zheng, Ge Zhang, Wenhu Chen
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We fine-tune base LLMs on this dataset, we build MAmmo TH2 models, which significantly boost performance on reasoning benchmarks. Notably, MAmmo TH2-7B s (Mistral) performance increases from 11% to 36.7% on MATH and from 36% to 68.4% on GSM8K without training on any in-domain data. Further training MAmmo TH2 on public instruction tuning datasets yields MAmmo TH2-Plus, achieving state-of-the-art performance on several reasoning and chatbot benchmarks. |
| Researcher Affiliation | Academia | Xiang Yue , Tuney Zheng , Ge Zhang , Wenhu Chen Carnegie Mellon University, University of Waterloo xyue2@andrew.cmu.edu wenhuchen@uwaterloo.ca |
| Pseudocode | No | The paper describes its pipeline with diagrams but does not include any pseudocode or algorithm blocks. |
| Open Source Code | Yes | The paper submitted the data and code with the paper. We will also release the data and code to the public upon acceptance. |
| Open Datasets | Yes | Unlike existing instruction-tuning datasets, our dataset WEBINSTRUCT is purely mined from the Web without any human crowdsourcing or GPT-4 distillation. ... The paper submitted the data and code with the paper. We will also release the data and code to the public upon acceptance. |
| Dataset Splits | No | The paper uses its collected WEBINSTRUCT dataset for training and evaluates performance on separate held-out reasoning benchmarks. It does not explicitly specify a validation split or internal test split for the WEBINSTRUCT dataset itself. |
| Hardware Specification | Yes | All the models are trained with 32 A100 GPUs. |
| Software Dependencies | No | The paper mentions using the LLa MA-Factory [Zheng et al., 2024d] library but does not provide specific version numbers for it or other software components. |
| Experiment Setup | Yes | We use a learning rate of 5e-6 for Mistral 7B and 1e-5 for Mixtral, Llama-3 8B, and Yi 34B. The global batch size is set to 512 with a maximum sequence length of 4096. We employ a cosine scheduler with a 3% warm-up period for 2 epochs. |