Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

ReMA: Learning to Meta-Think for LLMs with Multi-agent Reinforcement Learning

Authors: Ziyu Wan, Yunxiang Li, Xiaoyu Wen, Yan Song, Hanjing Wang, Linyi Yang, Mark Schmidt, Jun Wang, Weinan Zhang, Shuyue Hu, Ying Wen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical results from single-turn experiments demonstrate that Re MA outperforms single-agent RL baselines on complex reasoning tasks, including competitive-level mathematical benchmarks and LLM-as-a-Judge benchmarks. Additionally, we further extend Re MA to multi-turn interaction settings, leveraging turn-level ratio and parameter sharing to improve efficiency. Comprehensive ablation studies further illustrate the evolving dynamics of each distinct agent, providing valuable insights into how the metathinking reasoning process enhances the reasoning capabilities of LLMs.
Researcher Affiliation Academia 1 Shanghai Jiao Tong University 2 Shanghai Artificial Intelligence Laboratory 3 University of British Columbia 4 University College London 5 Canada CIFAR AI Chair (Amii)
Pseudocode Yes The pseudocode is shown in Algorithm 1.
Open Source Code Yes Our code can be found in https://github.com/ziyuwan/Re MA-public
Open Datasets Yes For mathematical reasoning experiments, we train models on 7.5k training samples in MATH [Hendrycks et al., 2021] and use MATH500 [Lightman et al., 2023] as the in-distribution test dataset. Additionally, we test the optimized models on out-of-distribution datasets: GSM8K [Cobbe et al., 2021], AIME244, AMC235, Gao Kao2023En [Zhang et al., 2023], Minerva Math [Lewkowycz et al., 2022], and Olympiad Bench [He et al., 2024]. For LLM-as-a-Judge benchmarks, we train models on Reward Bench [Lambert et al., 2024].
Dataset Splits Yes For mathematical reasoning experiments, we train models on 7.5k training samples in MATH [Hendrycks et al., 2021] and use MATH500 [Lightman et al., 2023] as the in-distribution test dataset. For LLM-as-a-Judge benchmarks, we train models on Reward Bench [Lambert et al., 2024]. Specifically, we convert the original data into a pair-ranking format and split it into a training set of 5k items and a test set of 970 items, denoted as Reward Bench970. For the ablation results in Fig 6, we use a tiny subset of MATH Level 3-5, training for 300 steps. Specifically, we sample 19 questions for every single type (133 instances in total).
Hardware Specification Yes All experiments are conducted in a node of 8 NVIDIA A100 GPUs. We use 32 NVIDIA A800 GPUs, the longest training cost about 40 hours due to large scale validation per 10 steps. We use 8 NVIDIA A800 GPUs, the training cost about 30 hours
Software Dependencies No The paper mentions Open RLHF, REINFORCE++, Adam Optimizer, Ve RL, GRPO, Llama Factory, and Deep Speed Zero2 but does not specify their version numbers.
Experiment Setup Yes During rollout, we set temperature=1.0, top p=1.0, top k=-1, and use v LLM for inference acceleration. We set the max generation length to be 2048 and, the rollout batch size to be 1000. The number of samples per prompt is 4. During training, we use Adam Optimizer with a learning rate of 5e-7. We set the mini-batch size to be 500, and the clip ratio to be 0.2. Other hyperparameters, such as KL coefficients and the number of training episodes, were carefully tuned based on validation set performance to ensure robust and reliable results. For Llama3-8B-Instruct, we set the learning rate of 2e-7 for stable training. We use εmin = 0.2, εmax = 0.8 for prompt filtering. We use the same #Training Episode=4 for all models, and for #Update Iteration, we use 3 for Llama3-8B-Instruct and Llama3.18B-Instruct, 10 for Qwen2.5-7B-Instruct. And we set the KL coefficient to be 1e-2 for all the 3 models.