reproducibilityindex.ai

Language Model Decoding as Direct Metrics Optimization

Authors: Haozhe Ji, Pei Ke, Hongning Wang, Minlie Huang

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on various domains and model scales demonstrate the superiority of our method in metrics alignment with human texts and human evaluation over strong baselines.
Researcher Affiliation	Academia	Haozhe Ji Pei Ke Hongning Wang Minlie Huang The Co AI Group, DCST, BNRist, Tsinghua University, Beijing 100084, China jihaozhe@gmail.com aihuang@tsinghua.edu.cn
Pseudocode	Yes	Algorithm 1 µopt estimation with WIS ... Algorithm 2 Conditional Sampling with SIR
Open Source Code	No	The paper does not provide an explicit statement about releasing open-source code for DAEMON, nor does it include a direct link to a code repository for its methodology. It mentions using official implementations for baselines in footnote 4, but not for their own method.
Open Datasets	Yes	We evaluate our method on the Wikipedia and News domain for open-ended text generation. For the Wikipedia domain, the data comes from documents in the Wikitext-103 corpus (Merity et al., 2017). For the News domain, the data comes from news articles in Wikinews1. ... 1http://www.wikinews.org.
Dataset Splits	Yes	We follow the data pre-processing procedure suggested by Li et al. (2022), and randomly select 512 samples as the development set for hyper-parameter tuning for all decoding methods. The data statistics of each domain and detailed data pre-processing steps are provided in Appendix J. ... Table 7: Statistics of the data used in the experiments. ... For each example, the first 32 words are used as the prefix. ... During evaluation, the subwords length of human references is also truncated to 256 for a reasonable comparison.
Hardware Specification	Yes	To evaluate the real runtime performance, we follow the setting in Section 3.6 which uses GPT2-XL to generate completions with maximum length of 256 on the test set of Wikitext. The experiment was done on a Tesla V100. ... The inference of DAEMON with the base model of either GPT-2 XL or OPT-6.7B can be done on a Tesla V100 with a batch size of 1.
Software Dependencies	No	The paper mentions using specific pre-trained models like 'GPT-2 XL (1.5B)' and 'OPT-6.7B' as base models and 'Sim CSE' based on 'RoBERTa' for coherence calculation. However, it does not specify version numbers for these or any other underlying software, libraries, or programming languages (e.g., Python, PyTorch, TensorFlow versions) that would be needed for reproducibility.
Experiment Setup	Yes	To demonstrate the effectiveness of our method across different language model families and scales, we consider GPT-2 XL (1.5B) (Radford et al., 2019) and OPT-6.7B (Zhang et al., 2022) as the base models for all decoding methods. For baselines, we follow the hyper-parameter settings in the original papers which are shown to work well in general. For DAEMON in the main results, we use the nine metrics (described in 3.2) in the constraints. During sampling, we set the size of candidate set from the proposal model M = 25 as it balances efficiency and performance. We set τ = 0.97 for the Wikipedia domain and τ = 0.99 for the News domain. We leave more implementation details of the baselines and DAEMON in Appendix J.2.