Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

More Agents Is All You Need

Authors: junyou li, Qin Zhang, Yangbin Yu, QIANG FU, Deheng Ye

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct comprehensive experiments on a wide range of LLM benchmarks to verify the presence of our finding, and to study the properties that can facilitate its occurrence.
Researcher Affiliation Industry Junyou Li EMAIL Tencent Qin Zhang EMAIL Tencent Yangbin Yu EMAIL Tencent Qiang Fu EMAIL Tencent Deheng Ye EMAIL Tencent
Pseudocode Yes Algorithm 1 Agent Forest Require: Query x, number of samples N, LLM M or LLM integrated with other methods f M(x) 1: Initialize an empty set for samples S 2: for i = 1 to N do 3: Generate sample si M(x) or si f M(x) 4: Add sample to the set S S {si} 5: end for 6: for each sample si in S do 7: Initialize similarity scores V (si) 0 8: for each sample sj in S do 9: if i = j then 10: V (si) V (si) + sim(si, sj) 11: end if 12: end for 13: end for 14: A arg maxsi S V (si) 15: return A
Open Source Code Yes Our code is publicly available at: https://github.com/More Agents Is All You Need/Agent Forest.
Open Datasets Yes Arithmetic Reasoning. Similar to Wang et al. (2023b); Fu et al. (2023); Du et al. (2023), we select the GSM8K Cobbe et al. (2021a) as one of the test sets. Additionally, we select the more challenging MATH dataset Hendrycks et al. (2021b), which is used by Wu et al. (2023). General Reasoning. Similar to Du et al. (2023); Jiang et al. (2023), we select the MMLU Hendrycks et al. (2021a). Additionally, we select the dataset from the chess state tracking task (Chess) 1, which is used by Du et al. (2023); Zhang et al. (2023). Code Generation. Similar to Liu et al. (2023), we select the Human Eval Chen et al. (2021).
Dataset Splits No Arithmetic Reasoning. Similar to Wang et al. (2023b); Fu et al. (2023); Du et al. (2023), we select the GSM8K Cobbe et al. (2021a) as one of the test sets. Additionally, we select the more challenging MATH dataset Hendrycks et al. (2021b), which is used by Wu et al. (2023). (This only mentions "test sets" but not how the datasets are split for training, validation, or testing, or if they use predefined splits for all mentioned datasets.)
Hardware Specification No Language models adopted We evaluate our method using language models of different scales from the Llama2 Touvron et al. (2023) and GPT series Open AI (2022). Specifically, we evaluate two versions of Llama2-Chat2, optimized for conversational use cases through alignment techniques, with model sizes of 13B and 70B parameters. Additionally, we include GPT-3.5-Turbo and GPT-4 in our evaluation. (This describes the models used, not the hardware they ran on.)
Software Dependencies No To implement our method, we compute the BLEU score Papineni et al. (2002) among all pairs of generated candidate answers." and "In the voting phase, we compute the BLEU score using sacre BLEU Post (2018) to evaluate the similarity between each of the generated samples. (While these mention software, they do not include specific version numbers for Python, sacreBLEU, or any other critical dependencies.)
Experiment Setup Yes Detailed experimental settings are provided in the Appendix A. In all experiments involving GPT-3.5-Turbo presented in Section 4, we utilize the model version gpt-3.5-turbo-0613. In Table 2, the notation GPT-4 corresponds to the model version gpt-4-0613. For the experiments conducted with GPT-3.5-Turbo in Section 6, we employ the model version gpt-3.5-turbo-1106 with the JSON mode enabled. Similarly, GPT-4 in this context refers to gpt4-1106-Preview operating in JSON mode." and "The effectiveness of our method is evaluated by averaging the results across 10 independent runs. During each run, we scale up the ensemble size to 40 to ensure maximum gains. However, when integrating our method with the Debate Du et al. (2023), the ensemble size is limited to 10 due to the significant computational overhead introduced by the communication architecture.