Benchmarking and Improving Generator-Validator Consistency of Language Models
Authors: Xiang Lisa Li, Vaishnavi Shrivastava, Siyan Li, Tatsunori Hashimoto, Percy Liang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Evaluated across 6 tasks, including math questions, knowledge-intensive QA, and instruction following, our method improves generator quality by an average of 16% and validator accuracy by an average of 6.3% across all tasks. |
| Researcher Affiliation | Academia | Xiang Lisa Li, Vaishnavi Shrivastava, Siyan Li, Tatsunori Hashimoto, Percy Liang Stanford University, Columbia University {xlisali, vaish1, thashim}@stanford.edu, siyan.li@columbia.edu pliang@cs.stanford.edu |
| Pseudocode | No | The paper describes the steps for consistency fine-tuning using text and diagrams (Figure 2), but it does not provide formal pseudocode or a clearly labeled algorithm block. |
| Open Source Code | Yes | Code is available at https://github.com/Xiang Li1999/GV-consistency |
| Open Datasets | Yes | We evaluate GV-consistency score on 6 aforementioned tasks ( 3): arithmetic (Lin et al., 2022), plan arithmetic (Bubeck et al., 2023), question answering (Joshi et al., 2017), harmful questions (Perez et al., 2022), prompt prioritization, and style transfer (Reif et al., 2022; Li et al., 2018a). |
| Dataset Splits | No | The paper uses existing benchmarks and does not specify explicit training/validation/test data splits in terms of percentages or counts. While it discusses 'validation' in the context of its model component, it does not describe a 'validation set' for data splitting. |
| Hardware Specification | Yes | Each fine-tuning experiment was run on 8 A100 machines. Our fine-tuning is conducted on 8 A100 GPUs of 80GB memory, and we use Deepspeed Stage 3 to ensure the 30B model fits on GPU. |
| Software Dependencies | No | The paper states 'Our implementation is based on Hugging Face Transformer (Wolf et al., 2020), and the PEFT (Mangrulkar et al., 2022) library.' but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | We use a Lo RA low-rank dimension of 32, a learning rate of 2e-4, and a batch size of 64 (see more details in Appendix F). We finetune the Alpaca models using the Adam W optimizer and a cosine learning rate schedule. We use a warmup ratio of 0.03, learning rate of 2e 4, batch size of 64 (with gradient accumulation steps of 8 and 8 GPU machines). We use epoch size of 3 for arithmetic because it has an abundance of training data, and we use epoch size of 6 for all other tasks. As noted in 5, we finetune the 30B model using parameter-efficient approaches (Li & Liang, 2021; Hu et al., 2022; Houlsby et al., 2019) like Lo RA with low-rank dimension of 32 and α of 32. |