Language Model Decoding as Direct Metrics Optimization
Authors: Haozhe Ji, Pei Ke, Hongning Wang, Minlie Huang
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on various domains and model scales demonstrate the superiority of our method in metrics alignment with human texts and human evaluation over strong baselines. |
| Researcher Affiliation | Academia | Haozhe Ji Pei Ke Hongning Wang Minlie Huang The Co AI Group, DCST, BNRist, Tsinghua University, Beijing 100084, China jihaozhe@gmail.com aihuang@tsinghua.edu.cn |
| Pseudocode | Yes | Algorithm 1 µopt estimation with WIS ... Algorithm 2 Conditional Sampling with SIR |
| Open Source Code | No | The paper does not provide an explicit statement about releasing open-source code for DAEMON, nor does it include a direct link to a code repository for its methodology. It mentions using official implementations for baselines in footnote 4, but not for their own method. |
| Open Datasets | Yes | We evaluate our method on the Wikipedia and News domain for open-ended text generation. For the Wikipedia domain, the data comes from documents in the Wikitext-103 corpus (Merity et al., 2017). For the News domain, the data comes from news articles in Wikinews1. ... 1http://www.wikinews.org. |
| Dataset Splits | Yes | We follow the data pre-processing procedure suggested by Li et al. (2022), and randomly select 512 samples as the development set for hyper-parameter tuning for all decoding methods. The data statistics of each domain and detailed data pre-processing steps are provided in Appendix J. ... Table 7: Statistics of the data used in the experiments. ... For each example, the first 32 words are used as the prefix. ... During evaluation, the subwords length of human references is also truncated to 256 for a reasonable comparison. |
| Hardware Specification | Yes | To evaluate the real runtime performance, we follow the setting in Section 3.6 which uses GPT2-XL to generate completions with maximum length of 256 on the test set of Wikitext. The experiment was done on a Tesla V100. ... The inference of DAEMON with the base model of either GPT-2 XL or OPT-6.7B can be done on a Tesla V100 with a batch size of 1. |
| Software Dependencies | No | The paper mentions using specific pre-trained models like 'GPT-2 XL (1.5B)' and 'OPT-6.7B' as base models and 'Sim CSE' based on 'RoBERTa' for coherence calculation. However, it does not specify version numbers for these or any other underlying software, libraries, or programming languages (e.g., Python, PyTorch, TensorFlow versions) that would be needed for reproducibility. |
| Experiment Setup | Yes | To demonstrate the effectiveness of our method across different language model families and scales, we consider GPT-2 XL (1.5B) (Radford et al., 2019) and OPT-6.7B (Zhang et al., 2022) as the base models for all decoding methods. For baselines, we follow the hyper-parameter settings in the original papers which are shown to work well in general. For DAEMON in the main results, we use the nine metrics (described in 3.2) in the constraints. During sampling, we set the size of candidate set from the proposal model M = 25 as it balances efficiency and performance. We set τ = 0.97 for the Wikipedia domain and τ = 0.99 for the News domain. We leave more implementation details of the baselines and DAEMON in Appendix J.2. |