reproducibilityindex.ai

Benchmarking and Improving Generator-Validator Consistency of Language Models

Authors: Xiang Lisa Li, Vaishnavi Shrivastava, Siyan Li, Tatsunori Hashimoto, Percy Liang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Evaluated across 6 tasks, including math questions, knowledge-intensive QA, and instruction following, our method improves generator quality by an average of 16% and validator accuracy by an average of 6.3% across all tasks.
Researcher Affiliation	Academia	Xiang Lisa Li, Vaishnavi Shrivastava, Siyan Li, Tatsunori Hashimoto, Percy Liang Stanford University, Columbia University {xlisali, vaish1, thashim}@stanford.edu, siyan.li@columbia.edu pliang@cs.stanford.edu
Pseudocode	No	The paper describes the steps for consistency fine-tuning using text and diagrams (Figure 2), but it does not provide formal pseudocode or a clearly labeled algorithm block.
Open Source Code	Yes	Code is available at https://github.com/Xiang Li1999/GV-consistency
Open Datasets	Yes	We evaluate GV-consistency score on 6 aforementioned tasks ( 3): arithmetic (Lin et al., 2022), plan arithmetic (Bubeck et al., 2023), question answering (Joshi et al., 2017), harmful questions (Perez et al., 2022), prompt prioritization, and style transfer (Reif et al., 2022; Li et al., 2018a).
Dataset Splits	No	The paper uses existing benchmarks and does not specify explicit training/validation/test data splits in terms of percentages or counts. While it discusses 'validation' in the context of its model component, it does not describe a 'validation set' for data splitting.
Hardware Specification	Yes	Each fine-tuning experiment was run on 8 A100 machines. Our fine-tuning is conducted on 8 A100 GPUs of 80GB memory, and we use Deepspeed Stage 3 to ensure the 30B model fits on GPU.
Software Dependencies	No	The paper states 'Our implementation is based on Hugging Face Transformer (Wolf et al., 2020), and the PEFT (Mangrulkar et al., 2022) library.' but does not provide specific version numbers for these software components.
Experiment Setup	Yes	We use a Lo RA low-rank dimension of 32, a learning rate of 2e-4, and a batch size of 64 (see more details in Appendix F). We finetune the Alpaca models using the Adam W optimizer and a cosine learning rate schedule. We use a warmup ratio of 0.03, learning rate of 2e 4, batch size of 64 (with gradient accumulation steps of 8 and 8 GPU machines). We use epoch size of 3 for arithmetic because it has an abundance of training data, and we use epoch size of 6 for all other tasks. As noted in 5, we finetune the 30B model using parameter-efficient approaches (Li & Liang, 2021; Hu et al., 2022; Houlsby et al., 2019) like Lo RA with low-rank dimension of 32 and α of 32.