Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Improving Retrieval-Augmented Generation through Multi-Agent Reinforcement Learning
Authors: Yiqun Chen, Lingyong Yan, Weiwei Sun, Xinyu Ma, Yi Zhang, Shuaiqiang Wang, Dawei Yin, Yiming Yang, Jiaxin Mao
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments conducted on various QA benchmarks demonstrate that MMOA-RAG effectively boost the overall performance of the pipeline and outperforms existing baselines. Furthermore, comprehensive ablation studies validate the contributions of individual components and demonstrate MMOA-RAG can be adapted to different RAG pipelines and benchmarks. |
| Researcher Affiliation | Collaboration | 1Renmin University of China 2Baidu Inc. 3Carnegie Mellon University EMAIL, EMAIL |
| Pseudocode | Yes | C The Pseudocode of Multi-Agent Training Process of MMOA-RAG Algorithm 1 is the pseudocode for multi-agent optimization based on MAPPO. |
| Open Source Code | Yes | 2The code of MMOA-RAG is on https://github.com/chenyiqun/MMOA-RAG. |
| Open Datasets | Yes | We conducted experiments using MMOA-RAG alongside various baseline models across three open-domain QA datasets: Hotpot QA [53], 2Wiki Multihop QA [14], and Ambig QA [31]. The candidate documents are all retrieved from Wikipedia passages for three datasets. |
| Dataset Splits | No | The paper lists the datasets used (Hotpot QA [53], 2Wiki Multihop QA [14], and Ambig QA [31]) but does not explicitly provide specific details about the training, validation, or test splits (e.g., percentages, sample counts, or explicit references to standard splits used). While these are standard benchmarks often with predefined splits, the paper itself does not describe them. |
| Hardware Specification | No | The paper states in the NeurIPS Paper Checklist (Question 8) that "We provide the computational resource requirements in our anonymous code." This indicates that the specific hardware details are not present within the main body of the paper. |
| Software Dependencies | No | The paper mentions "Building on the PPO code from LLama-Factory5 [60], we have developed MMOA-RAG, which optimizes the RAG multi-agent system using Multi-Agent PPO." While LLama-Factory is mentioned, specific version numbers for it or other software dependencies (like Python, PyTorch, etc.) are not provided. |
| Experiment Setup | Yes | Table 6: Key hyperparameters in the training process of MMOA-RAG. Name Explanation Values βmax Maximum β in Equation (15) 0.2 βmin Minimum β in Equation (15) 0.06 γ Key hyperparameter in GAE 1.0 λ Key hyperparameter in GAE 0.95 ϵ Clip range in MAPPO 0.2 α Coefficients in Equation (10) 0.1 lr Maximum learning rate 2e-5 bueffer_size Buffer size in MAPPO 128 lr_scheduler Learning rate scheduler cosine top_p Sampling parameters in training 0.9 |