reproducibilityindex.ai

Adaptable Logical Control for Large Language Models

Authors: Honghua Zhang, Po-Nien Kung, Masahiro Yoshida, Guy Van den Broeck, Nanyun Peng

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate Ctrl-G on the task of text editing: in the domain of story writing, we evaluate models ability to generate suggestions for text insertions/continuations under combinations of logical constraints (e.g. keyphrase inclusion and length control; see Fig. 2). Human evaluation shows that Ctrl-G, where a TULU2-7B model [13] is combined with a 2B-parameter HMM, outperforms prominent LLMs including GPT3.5 and GPT4 [27] by over 30% in overall satisfaction rate (i.e., percentage of the generated text that is not only fluent but also satisfies the constraints). We evaluate Ctrl-G on the Commonsene Generation (Common Gen) benchmark [18]. We also evaluate Ctrl-G on a text infilling benchmark [7].
Researcher Affiliation	Collaboration	Honghua Zhang UCLA hzhang19@cs.ucla.edu Po-Nien Kung UCLA ponienkung@cs.ucla.edu Masahiro Yoshida UCLA & Sony Group Corporation masahiroyoshida@ucla.edu Guy Van den Broeck UCLA guyvdb@cs.ucla.edu Nanyun Peng UCLA violetpeng@cs.ucla.edu
Pseudocode	Yes	Algorithm 1: Ctrl-G: sampling n tokens. Algorithm 1 shows the pseudo-code for sampling from pctrl-g(x1:n \| α) autoregressively, using the recurrence relations above.
Open Source Code	Yes	Code available at https://github.com/joshuacnf/Ctrl-G.
Open Datasets	Yes	We first evaluate Ctrl-G on the Commonsene Generation (Common Gen) benchmark [18]. We also evaluate Ctrl-G on a text infilling benchmark [7] constructed from the ROC stories corpus [26].
Dataset Splits	Yes	We construct an evaluation dataset consisting of 800 test examples, each based on one story passage extracted from the Co Author dataset [16]. We randomly select 100 examples with 5 concepts from the dev split of Common Gen, and then augment them with additional keywords sampled from their reference sentences.
Hardware Specification	Yes	The runtime measurements are conducted on an NVIDIA-A100 GPU with 80GB memory.
Software Dependencies	No	The paper does not explicitly list specific software dependencies with their version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	We adopt the TULU2-7B [13] model, which is an instruction-tuned variant of the Llama2 [39] model with 7 billion parameters, as the base model for Ctrl-G. We further finetune the base model on 3000 examples extracted from the Writing Prompt dataset [8] for the task of text continuation. After finetuning, we use the same prompt to sample 5 million examples from the base model and train an HMM with 32768 hidden states (approx. 2 billion parameters). For generation, we sample 128 examples from pctrl-g with temperature 0.7 and pick the one with the highest likelihood given by the base model as the final output. We use the GPT2-large checkpoint (only finetuned for domain adaptation) released by [47] as our base model and we follow the same pipeline to distill an HMM with 32768 hidden states: we sample 4M examples from the base model and train the HMM for 40 EM steps, each consisting of 100K examples.