Let Models Speak Ciphers: Multiagent Debate through Embeddings
Authors: Chau Pham, Boyi Liu, Yingxiang Yang, Zhengyu Chen, Tianyi Liu, Jianbo Yuan, Bryan A. Plummer, Zhaoran Wang, Hongxia Yang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Remarkably, by deviating from natural language, CIPHER offers an advantage of encoding a broader spectrum of information without any modification to the model weights, outperforming the state-of-the-art LLM debate methods using natural language by 0.5 5.0% across five reasoning tasks and multiple open-source LLMs of varying sizes. |
| Researcher Affiliation | Collaboration | Chau Pham1 Boyi Liu2 Yingxiang Yang3 Zhengyu Chen3 Tianyi Liu3 Jianbo Yuan3 Bryan A. Plummer1 Zhaoran Wang2 Hongxia Yang3 1Boston University, 2Northwestern University, 3Byte Dance Inc. |
| Pseudocode | Yes | Algorithm 1 CIPHER Debate; Algorithm 2 Multiagent Natural Language Debate |
| Open Source Code | No | The paper discusses the use of open-source LLMs but does not state that the code for their proposed method (CIPHER) is open-source or provide a link to its implementation. |
| Open Datasets | Yes | We evaluate CIPHER Debate on five reasoning datasets that span across four different domains. (i) GSM8K (Cobbe et al., 2021) consists of a variety of grade school math problems created by human problem writers. (ii) MMLU (Hendrycks et al., 2020) we pick three datasets from three different categories, Formal Logic dataset from the Humanities category, High School Math dataset from the STEM category, and Professional Psychology dataset from the Social Science category. (iii) Arithmetic: following Du et al. (2023), we evaluate mathematical expressions comprising six unique two-digit numbers that include addition, multiplication, and subtraction operations. |
| Dataset Splits | Yes | For large datasets (GSM8K, Professional Psychology, and Arithmetic), we tune the temperature on a validation set of 200 sampled questions and evaluate on another 200 questions in a separate test set. |
| Hardware Specification | Yes | For LLa MA family debates, we use 4 NVIDIA A100 SXM 80GB GPUs as the major computation resource. |
| Software Dependencies | No | The paper mentions using Bayesian optimization (Nogueira, 2014) and various LLM models (e.g., LLa MA2-70B), but does not provide specific version numbers for software dependencies or libraries used for implementation. |
| Experiment Setup | Yes | To ensure fairness of our empirical evaluation, we utilize Bayesian optimization (Nogueira, 2014) to select the best performing temperatures for each method in our experiments in Section 4. Moreover, we conduct sensitivity analysis on the temperatures in Section 5.2. ... We combine few-shot examples with chain-of-thought prompting (Wei et al., 2022) and zero-shot instruction ( Let s think step by step ) (Kojima et al., 2022) to encourage agents to generate both the final answer and the reasoning steps. ... See Appendix E for detailed prompts. ... In Appendix D, we provide the temperatures of the debaters in each of our experiments. |