reproducibilityindex.ai

Reflective Multi-Agent Collaboration based on Large Language Models

Authors: Xiaohe Bo, Zeyu Zhang, Quanyu Dai, Xueyang Feng, Lei Wang, Rui Li, Xu Chen, Ji-Rong Wen

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct experiments on three datasets to evaluate the performance of our model in multihop question answering, mathematics, and chess scenarios. Experimental results show that COPPER possesses stronger reflection capabilities and exhibits excellent generalization performance across different actor models.
Researcher Affiliation	Collaboration	1 Gaoling School of Artificial Intelligence, Renmin University of China 2 Huawei Noah s Ark Lab {xiaohe,zeyuzhang,xueyangfeng,wanglei154,lirui121200,xu.chen,jrwen}@ruc.edu.cn, daiquanyu@huawei.com
Pseudocode	No	The paper describes the processes and framework but does not contain any formally labeled "Pseudocode" or "Algorithm" blocks.
Open Source Code	Yes	Our code can be found at: https://anonymous.4open.science/r/copper-F72A/
Open Datasets	Yes	We choose Hot Pot QA [36], GSM8K [7], and Checkmate in One Move [28] to evaluate the collaborative abilities of multi-agent systems in multi-hop question answering, mathematics and chess.
Dataset Splits	Yes	For SFT training, we tune the epoch in {1, 2, 3, 4}, batch size in {64, 128, 256}, and learning rate in {1e-4, 2e-4, 3e-4, 5e-4} through grid search on a validation set with 100 instances, while for counterfactual PPO, we change the search range of learning rate to {1e-5, 2e-5, 3e-5, 5e-5}.
Hardware Specification	Yes	We conduct all experiments on four NVIDIA A800-80G GPUs.
Software Dependencies	No	The paper mentions specific models like GPT-3.5 (model: gpt-3.5-turbo), Long Chat (model: longchat-7b-16k), and GPT-2, and packages like "trl package of Hugging Face" and "Sim CSE", but it does not provide explicit version numbers for these software packages or libraries.
Experiment Setup	Yes	We set the maximum number of trials to 5, the temperature of GPT-3.5 to 0, and the temperature of Long Chat to 0.9. For SFT training, we tune the epoch in {1, 2, 3, 4}, batch size in {64, 128, 256}, and learning rate in {1e-4, 2e-4, 3e-4, 5e-4} through grid search on a validation set with 100 instances, while for counterfactual PPO, we change the search range of learning rate to {1e-5, 2e-5, 3e-5, 5e-5}. As for the reward model, we set learning rate to 5e-5, training epoch to 3 and batch size to 16. We set the temperature of both GPT-3.5 and Long Chat to 0 during the test phase to ensure reproducibility.