Adaptable Logical Control for Large Language Models
Authors: Honghua Zhang, Po-Nien Kung, Masahiro Yoshida, Guy Van den Broeck, Nanyun Peng
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate Ctrl-G on the task of text editing: in the domain of story writing, we evaluate models ability to generate suggestions for text insertions/continuations under combinations of logical constraints (e.g. keyphrase inclusion and length control; see Fig. 2). Human evaluation shows that Ctrl-G, where a TULU2-7B model [13] is combined with a 2B-parameter HMM, outperforms prominent LLMs including GPT3.5 and GPT4 [27] by over 30% in overall satisfaction rate (i.e., percentage of the generated text that is not only fluent but also satisfies the constraints). We evaluate Ctrl-G on the Commonsene Generation (Common Gen) benchmark [18]. We also evaluate Ctrl-G on a text infilling benchmark [7]. |
| Researcher Affiliation | Collaboration | Honghua Zhang UCLA hzhang19@cs.ucla.edu Po-Nien Kung UCLA ponienkung@cs.ucla.edu Masahiro Yoshida UCLA & Sony Group Corporation masahiroyoshida@ucla.edu Guy Van den Broeck UCLA guyvdb@cs.ucla.edu Nanyun Peng UCLA violetpeng@cs.ucla.edu |
| Pseudocode | Yes | Algorithm 1: Ctrl-G: sampling n tokens. Algorithm 1 shows the pseudo-code for sampling from pctrl-g(x1:n | α) autoregressively, using the recurrence relations above. |
| Open Source Code | Yes | Code available at https://github.com/joshuacnf/Ctrl-G. |
| Open Datasets | Yes | We first evaluate Ctrl-G on the Commonsene Generation (Common Gen) benchmark [18]. We also evaluate Ctrl-G on a text infilling benchmark [7] constructed from the ROC stories corpus [26]. |
| Dataset Splits | Yes | We construct an evaluation dataset consisting of 800 test examples, each based on one story passage extracted from the Co Author dataset [16]. We randomly select 100 examples with 5 concepts from the dev split of Common Gen, and then augment them with additional keywords sampled from their reference sentences. |
| Hardware Specification | Yes | The runtime measurements are conducted on an NVIDIA-A100 GPU with 80GB memory. |
| Software Dependencies | No | The paper does not explicitly list specific software dependencies with their version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | We adopt the TULU2-7B [13] model, which is an instruction-tuned variant of the Llama2 [39] model with 7 billion parameters, as the base model for Ctrl-G. We further finetune the base model on 3000 examples extracted from the Writing Prompt dataset [8] for the task of text continuation. After finetuning, we use the same prompt to sample 5 million examples from the base model and train an HMM with 32768 hidden states (approx. 2 billion parameters). For generation, we sample 128 examples from pctrl-g with temperature 0.7 and pick the one with the highest likelihood given by the base model as the final output. We use the GPT2-large checkpoint (only finetuned for domain adaptation) released by [47] as our base model and we follow the same pipeline to distill an HMM with 32768 hidden states: we sample 4M examples from the base model and train the HMM for 40 EM steps, each consisting of 100K examples. |