reproducibilityindex.ai

Controlled Decoding from Language Models

Authors: Sidharth Mudgal, Jong Lee, Harish Ganapathy, Yaguang Li, Tao Wang, Yanping Huang, Zhifeng Chen, Heng-Tze Cheng, Michael Collins, Trevor Strohman, Jilin Chen, Alex Beutel, Ahmad Beirami

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We empirically demonstrate that CD is effective as a control mechanism on popular benchmarks. We also show that preﬁx scorers for multiple rewards may be combined at inference time, effectively solving a multi-objective RL problem with no additional training.
Researcher Affiliation	Industry	1Google Deep Mind 2Open AI (work done at Google). Correspondence to: Sidharth Mudgal <sidharthms@google.com>, Jong Lee <leejong@google.com>, Ahmad Beirami <beirami@google.com>.
Pseudocode	No	No explicitly labeled pseudocode or algorithm blocks were found in the paper.
Open Source Code	No	The paper does not provide an explicit statement or a link to open-source code for the described methodology.
Open Datasets	Yes	DSTC8 Reddit conversations corpus (Microsoft, 2019) is a dataset containing millions of multi-turn conversations from Reddit threads. Anthropic HH (Bai et al., 2022) is a helpfulness and harmlessness benchmark... TL;DR (Stiennon et al., 2020) is a dataset of Reddit posts...
Dataset Splits	No	The paper refers to 'train' and 'test' sets and 'eval accuracy' but does not provide specific percentages, sample counts, or explicit details about how these splits (including a validation set if used) were created or their sizes, which are needed for reproduction.
Hardware Specification	No	The paper does not explicitly describe the specific hardware (e.g., GPU models, CPU types, or TPU versions) used for running its experiments.
Software Dependencies	No	The paper mentions models like Pa LM 2-XXS but does not provide specific version numbers for programming languages, libraries, or other software dependencies required to reproduce the experiments.
Experiment Setup	Yes	All methods were trained for half an epoch and evaluated on the number of tokens in the generation using the eval set of conversations corpus. Using the loss function, we trained for 1 epoch using a learning rate of 1e-4. We performed the training for 1 epoch with a learning rate of 1e-5.