reproducibilityindex.ai

The Consensus Game: Language Model Generation via Equilibrium Search

Authors: Athul Paul Jacob, Yikang Shen, Gabriele Farina, Jacob Andreas

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Applied to a large number of tasks (including reading comprehension, commonsense reasoning, mathematical problem-solving, and dialog), EQUILIBRIUM-RANKING consistently, and sometimes substantially, improves performance over existing LM decoding procedures on multiple benchmarks, we observe that applying EQUILIBRIUMRANKING to LLa MA-7B outperforms the much larger LLa MA-65B and Pa LM540B models. These results highlight the promise of game-theoretic tools for addressing fundamental challenges of truthfulness and consistency in LMs.
Researcher Affiliation	Collaboration	Athul Paul Jacob MIT Yikang Shen MIT-IBM AI Lab Gabriele Farina MIT Jacob Andreas MIT
Pseudocode	No	The paper includes mathematical equations (e.g., (1) and (2)) for policy updates but does not present them within a structured pseudocode or algorithm block.
Open Source Code	No	The paper does not contain an explicit statement offering access to its source code, nor does it provide a link to a code repository for the described methodology.
Open Datasets	Yes	MMLU (Hendrycks et al., 2020), ARC (Clark et al., 2018), RACE (Lai et al., 2017), HHH (Askell et al., 2021), and Truthful QA (Lin et al., 2022).
Dataset Splits	No	The paper mentions using established benchmarks and evaluating on their test sets, but it does not explicitly detail the training, validation, and test dataset splits (e.g., specific percentages or sample counts) within the main text.
Hardware Specification	No	The paper states, "We use the 7B and 13B parameter models from the LLa MA family (Touvron et al., 2023) and perform 16-bit inference for all our experiments," but it does not specify any particular hardware details such as GPU models, CPU types, or memory used for these experiments.
Software Dependencies	No	The paper mentions the use of LLa MA models and 16-bit inference but does not specify any particular software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow, or CUDA versions) required for reproducibility.
Experiment Setup	Yes	EQUILIBRIUM-RANKING has four parameters, ηD, λD and ηG, λG. Although tuning these parameters will lead to better performance, in all our experiments we set ηD = λD = ηG = λG = 0.1. We run EQUILIBRIUM-RANKING for 5000 iterations