Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].
The Consensus Game: Language Model Generation via Equilibrium Search
Authors: Athul Paul Jacob, Yikang Shen, Gabriele Farina, Jacob Andreas
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Applied to a large number of tasks (including reading comprehension, commonsense reasoning, mathematical problem-solving, and dialog), EQUILIBRIUM-RANKING consistently, and sometimes substantially, improves performance over existing LM decoding procedures on multiple benchmarks, we observe that applying EQUILIBRIUMRANKING to LLa MA-7B outperforms the much larger LLa MA-65B and Pa LM540B models. These results highlight the promise of game-theoretic tools for addressing fundamental challenges of truthfulness and consistency in LMs. |
| Researcher Affiliation | Collaboration | Athul Paul Jacob MIT Yikang Shen MIT-IBM AI Lab Gabriele Farina MIT Jacob Andreas MIT |
| Pseudocode | No | The paper includes mathematical equations (e.g., (1) and (2)) for policy updates but does not present them within a structured pseudocode or algorithm block. |
| Open Source Code | No | The paper does not contain an explicit statement offering access to its source code, nor does it provide a link to a code repository for the described methodology. |
| Open Datasets | Yes | MMLU (Hendrycks et al., 2020), ARC (Clark et al., 2018), RACE (Lai et al., 2017), HHH (Askell et al., 2021), and Truthful QA (Lin et al., 2022). |
| Dataset Splits | No | The paper mentions using established benchmarks and evaluating on their test sets, but it does not explicitly detail the training, validation, and test dataset splits (e.g., specific percentages or sample counts) within the main text. |
| Hardware Specification | No | The paper states, "We use the 7B and 13B parameter models from the LLa MA family (Touvron et al., 2023) and perform 16-bit inference for all our experiments," but it does not specify any particular hardware details such as GPU models, CPU types, or memory used for these experiments. |
| Software Dependencies | No | The paper mentions the use of LLa MA models and 16-bit inference but does not specify any particular software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, or CUDA versions) required for reproducibility. |
| Experiment Setup | Yes | EQUILIBRIUM-RANKING has four parameters, ηD, λD and ηG, λG. Although tuning these parameters will lead to better performance, in all our experiments we set ηD = λD = ηG = λG = 0.1. We run EQUILIBRIUM-RANKING for 5000 iterations |