Reflective Multi-Agent Collaboration based on Large Language Models
Authors: Xiaohe Bo, Zeyu Zhang, Quanyu Dai, Xueyang Feng, Lei Wang, Rui Li, Xu Chen, Ji-Rong Wen
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct experiments on three datasets to evaluate the performance of our model in multihop question answering, mathematics, and chess scenarios. Experimental results show that COPPER possesses stronger reflection capabilities and exhibits excellent generalization performance across different actor models. |
| Researcher Affiliation | Collaboration | 1 Gaoling School of Artificial Intelligence, Renmin University of China 2 Huawei Noah s Ark Lab {xiaohe,zeyuzhang,xueyangfeng,wanglei154,lirui121200,xu.chen,jrwen}@ruc.edu.cn, daiquanyu@huawei.com |
| Pseudocode | No | The paper describes the processes and framework but does not contain any formally labeled "Pseudocode" or "Algorithm" blocks. |
| Open Source Code | Yes | Our code can be found at: https://anonymous.4open.science/r/copper-F72A/ |
| Open Datasets | Yes | We choose Hot Pot QA [36], GSM8K [7], and Checkmate in One Move [28] to evaluate the collaborative abilities of multi-agent systems in multi-hop question answering, mathematics and chess. |
| Dataset Splits | Yes | For SFT training, we tune the epoch in {1, 2, 3, 4}, batch size in {64, 128, 256}, and learning rate in {1e-4, 2e-4, 3e-4, 5e-4} through grid search on a validation set with 100 instances, while for counterfactual PPO, we change the search range of learning rate to {1e-5, 2e-5, 3e-5, 5e-5}. |
| Hardware Specification | Yes | We conduct all experiments on four NVIDIA A800-80G GPUs. |
| Software Dependencies | No | The paper mentions specific models like GPT-3.5 (model: gpt-3.5-turbo), Long Chat (model: longchat-7b-16k), and GPT-2, and packages like "trl package of Hugging Face" and "Sim CSE", but it does not provide explicit version numbers for these software packages or libraries. |
| Experiment Setup | Yes | We set the maximum number of trials to 5, the temperature of GPT-3.5 to 0, and the temperature of Long Chat to 0.9. For SFT training, we tune the epoch in {1, 2, 3, 4}, batch size in {64, 128, 256}, and learning rate in {1e-4, 2e-4, 3e-4, 5e-4} through grid search on a validation set with 100 instances, while for counterfactual PPO, we change the search range of learning rate to {1e-5, 2e-5, 3e-5, 5e-5}. As for the reward model, we set learning rate to 5e-5, training epoch to 3 and batch size to 16. We set the temperature of both GPT-3.5 and Long Chat to 0 during the test phase to ensure reproducibility. |