Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models

Authors: Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James R. Glass, Pengcheng He

ICLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments on Truthful QA (Lin et al., 2022) and FACTOR Muhlgay et al. (2023) demonstrate that Do La is able to increase the truthfulness of the models of the LLa MA family (Touvron et al., 2023).
Researcher Affiliation Collaboration Massachusetts Institute of Technology, Microsoft EMAIL, EMAIL EMAIL, EMAIL
Pseudocode No The paper describes its method using text and mathematical equations, but does not include a clearly labeled pseudocode or algorithm block.
Open Source Code Yes The source code is available at https://github.com/voidism/DoLa.
Open Datasets Yes For multiple choices, we use Truthful QA (Lin et al., 2022) and FACTOR (News/Wiki) (Muhlgay et al., 2023) to assess LMs factuality in short-answer/long-paragraph settings, respectively.
Dataset Splits Yes We use either two-fold validation (Truthful QA-MC, FACTOR) or a validation set (GSM8K, Strategy QA) to select the best bucket.
Hardware Specification Yes We run all the experiments with NVIDIA V100 GPUs on the machines equipped with 40-core CPUs of Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHZ.
Software Dependencies No The paper mentions using the 'Huggingface Transformers package' and 'Huggingface accelerate package' but does not specify their version numbers.
Experiment Setup Yes We set adaptive plausibility constraint (α) to 0.1 and repetition penalty (θ) to 1.2 as per prior studies(Li et al., 2022; Keskar et al., 2019).