Adaptable Logical Control for Large Language Models

Authors: Honghua Zhang, Po-Nien Kung, Masahiro Yoshida, Guy Van den Broeck, Nanyun Peng

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate Ctrl-G on the task of text editing: in the domain of story writing, we evaluate models ability to generate suggestions for text insertions/continuations under combinations of logical constraints (e.g. keyphrase inclusion and length control; see Fig. 2). Human evaluation shows that Ctrl-G, where a TULU2-7B model [13] is combined with a 2B-parameter HMM, outperforms prominent LLMs including GPT3.5 and GPT4 [27] by over 30% in overall satisfaction rate (i.e., percentage of the generated text that is not only fluent but also satisfies the constraints). We evaluate Ctrl-G on the Commonsene Generation (Common Gen) benchmark [18]. We also evaluate Ctrl-G on a text infilling benchmark [7].
Researcher Affiliation Collaboration Honghua Zhang UCLA hzhang19@cs.ucla.edu Po-Nien Kung UCLA ponienkung@cs.ucla.edu Masahiro Yoshida UCLA & Sony Group Corporation masahiroyoshida@ucla.edu Guy Van den Broeck UCLA guyvdb@cs.ucla.edu Nanyun Peng UCLA violetpeng@cs.ucla.edu
Pseudocode Yes Algorithm 1: Ctrl-G: sampling n tokens. Algorithm 1 shows the pseudo-code for sampling from pctrl-g(x1:n | α) autoregressively, using the recurrence relations above.
Open Source Code Yes Code available at https://github.com/joshuacnf/Ctrl-G.
Open Datasets Yes We first evaluate Ctrl-G on the Commonsene Generation (Common Gen) benchmark [18]. We also evaluate Ctrl-G on a text infilling benchmark [7] constructed from the ROC stories corpus [26].
Dataset Splits Yes We construct an evaluation dataset consisting of 800 test examples, each based on one story passage extracted from the Co Author dataset [16]. We randomly select 100 examples with 5 concepts from the dev split of Common Gen, and then augment them with additional keywords sampled from their reference sentences.
Hardware Specification Yes The runtime measurements are conducted on an NVIDIA-A100 GPU with 80GB memory.
Software Dependencies No The paper does not explicitly list specific software dependencies with their version numbers (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes We adopt the TULU2-7B [13] model, which is an instruction-tuned variant of the Llama2 [39] model with 7 billion parameters, as the base model for Ctrl-G. We further finetune the base model on 3000 examples extracted from the Writing Prompt dataset [8] for the task of text continuation. After finetuning, we use the same prompt to sample 5 million examples from the base model and train an HMM with 32768 hidden states (approx. 2 billion parameters). For generation, we sample 128 examples from pctrl-g with temperature 0.7 and pick the one with the highest likelihood given by the base model as the final output. We use the GPT2-large checkpoint (only finetuned for domain adaptation) released by [47] as our base model and we follow the same pipeline to distill an HMM with 32768 hidden states: we sample 4M examples from the base model and train the HMM for 40 EM steps, each consisting of 100K examples.