Controlled Text Generation with Natural Language Instructions

Authors: Wangchunshu Zhou, Yuchen Eleanor Jiang, Ethan Wilcox, Ryan Cotterell, Mrinmaya Sachan

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate the effectiveness of INSTRUCTCTG, we conduct a wide range of experiments with different constraint types including keywords, syntax, semantic, style, and length, by fine-tuning T0 (Sanh et al., 2022), an instruction-tuned text-to-text pre-trained model. Our results show that INSTRUCTCTG achieves a high level of constraint satisfaction that is comparable to or better than existing decoding-based methods.
Researcher Affiliation Academia Wangchunshu Zhou 1 Yuchen Eleanor Jiang 1 Ethan Wilcox 1 Ryan Cotterell 1 Mrinmaya Sachan 1 1ETH Zürich.
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks. Procedures are described in narrative text.
Open Source Code Yes Our code is available at https://github.com/MichaelZhouwang/Instruct CTG.
Open Datasets Yes We randomly sample sentences from the C4 (Raffel et al., 2020) dataset to synthesize our keyword constraint text pairs. The sentences labeled with POSITIVE and NEGATIVE are taken from the Amazon Review dataset (He and Mc Auley, 2016) and the sentences labeled with NEUTRAL class are drawn randomly from the C4 dataset. Additionally, we use the Political Slant dataset (Voigt et al., 2018) for the political slant constraint and use the M2D2 dataset (Reid et al., 2022), which contains diverse topics in Wikipedia and arXiv categories, for topic control. We use Grammarly’s Yahoo Answers Formality Corpus (GYAFC) dataset (Rao and Tetreault, 2018) for formality constraints, the politeness transfer dataset collected by Madaan et al. (2020) for politeness control, the Flickr Style stylized caption dataset (Gan et al., 2017; Li et al., 2018) for the control of style, the Wiki Neutrality Corpus (Pryzant et al., 2020) for biasedness control, the PWKP (Zhu et al., 2010) dataset for text simplification. Our experiments consider paraphrase generation and question generation tasks and use the Quora Question Paraphrase dataset and the SQUAD question generation dataset (Rajpurkar et al., 2016; Du and Cardie, 2018), respectively.
Dataset Splits Yes Using the process described above, we synthesize 1 million training constraints text pairs for each category of constraint (lexical, syntactic, semantic, style, and length) and 50,000 pairs, each, for the development and test sets.
Hardware Specification No No specific hardware details (e.g., GPU models, CPU types, cloud instance specifications) were found. The paper mentions using T0-11B as a base model but does not specify the hardware used for fine-tuning or experiments.
Software Dependencies No The paper mentions using specific models like "RoBERTa-based classifier" and libraries like "Spacy" (with a link to a model: "https://spacy.io/models/en#en_core_web_sm") and "Moses tokenizer", but it does not specify explicit version numbers for these software dependencies or for general programming environments like Python or PyTorch.
Experiment Setup Yes We fine-tune our model with the Adam (Kingma and Ba, 2015) optimizer for 100000 steps with a learning rate of 1e-4, a batch size of 1024 text pairs, a dropout rate of 0.1, and a learning rate warmup of 8000 steps. Following Sanh et al. (2022), we perform checkpoint selection by choosing the checkpoint with the highest constraint satisfaction rate (see the next paragraph) on the validation splits of datasets of training constraints.