Reflective Multi-Agent Collaboration based on Large Language Models

Authors: Xiaohe Bo, Zeyu Zhang, Quanyu Dai, Xueyang Feng, Lei Wang, Rui Li, Xu Chen, Ji-Rong Wen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct experiments on three datasets to evaluate the performance of our model in multihop question answering, mathematics, and chess scenarios. Experimental results show that COPPER possesses stronger reflection capabilities and exhibits excellent generalization performance across different actor models.
Researcher Affiliation Collaboration 1 Gaoling School of Artificial Intelligence, Renmin University of China 2 Huawei Noah s Ark Lab {xiaohe,zeyuzhang,xueyangfeng,wanglei154,lirui121200,xu.chen,jrwen}@ruc.edu.cn, daiquanyu@huawei.com
Pseudocode No The paper describes the processes and framework but does not contain any formally labeled "Pseudocode" or "Algorithm" blocks.
Open Source Code Yes Our code can be found at: https://anonymous.4open.science/r/copper-F72A/
Open Datasets Yes We choose Hot Pot QA [36], GSM8K [7], and Checkmate in One Move [28] to evaluate the collaborative abilities of multi-agent systems in multi-hop question answering, mathematics and chess.
Dataset Splits Yes For SFT training, we tune the epoch in {1, 2, 3, 4}, batch size in {64, 128, 256}, and learning rate in {1e-4, 2e-4, 3e-4, 5e-4} through grid search on a validation set with 100 instances, while for counterfactual PPO, we change the search range of learning rate to {1e-5, 2e-5, 3e-5, 5e-5}.
Hardware Specification Yes We conduct all experiments on four NVIDIA A800-80G GPUs.
Software Dependencies No The paper mentions specific models like GPT-3.5 (model: gpt-3.5-turbo), Long Chat (model: longchat-7b-16k), and GPT-2, and packages like "trl package of Hugging Face" and "Sim CSE", but it does not provide explicit version numbers for these software packages or libraries.
Experiment Setup Yes We set the maximum number of trials to 5, the temperature of GPT-3.5 to 0, and the temperature of Long Chat to 0.9. For SFT training, we tune the epoch in {1, 2, 3, 4}, batch size in {64, 128, 256}, and learning rate in {1e-4, 2e-4, 3e-4, 5e-4} through grid search on a validation set with 100 instances, while for counterfactual PPO, we change the search range of learning rate to {1e-5, 2e-5, 3e-5, 5e-5}. As for the reward model, we set learning rate to 5e-5, training epoch to 3 and batch size to 16. We set the temperature of both GPT-3.5 and Long Chat to 0 during the test phase to ensure reproducibility.