A Causal Lens for Controllable Text Generation
Authors: Zhiting Hu, Li Erran Li
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show significant superiority of the causal approach over previous conditional models for improved control accuracy and reduced bias. |
| Researcher Affiliation | Collaboration | Zhiting Hu1,2, Li Erran Li2 1UC San Diego, 2AWS AI, Amazon |
| Pseudocode | No | The paper does not include any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not contain any explicit statement about releasing open-source code for the described methodology or a link to a code repository. |
| Open Datasets | Yes | Our first dataset is derived from the YELP challenge2 that contains customer reviews of different categories. Sentiment (1:positive vs. 0:negative) is the attribute we aim to control, and the category of review object (1:restaurant vs. 0:others) is the confounding factor. Specifically, we extract a subset of data where 90% restaurant reviews are of positive sentiment, while 90% reviews to other entities (e.g., shopping) are of negative sentiment (thus a 90% correlation strength). We keep the category labels for less than 2% of training data. The resulting data has 510K/6K training/validation examples, wherein 10K training examples have observable confounding category labels3. For evaluation, we further create a balanced test set of 13K examples with correlation strength 50% (i.e., no correlation). Following the previous controllable generation [22, 64], we focus on generating short text, by truncating the output text in the data to 20 tokens at maximum. The second dataset is from the BIOS corpus [10] that contains online biographies with gender and occupation labels. |
| Dataset Splits | Yes | The resulting data has 510K/6K training/validation examples... The second dataset is from the BIOS corpus [10]... We randomly split the dataset into 43K training and 2K validation examples, and keep the binary occupation labels for only 3K randomly selected training examples (among which only 5% 3K=150 examples have opposite gender and occupation labels). As above, we further create a balanced test set of 2K examples for evaluation, and truncate the output text to no more than 20 tokens. |
| Hardware Specification | Yes | All experiments were conducted on 8 Tesla V100 GPUs. |
| Software Dependencies | No | The paper mentions using GPT-2 architecture and Adam W optimizer but does not provide specific version numbers for software dependencies such as programming languages, deep learning frameworks, or libraries. |
| Experiment Setup | Yes | The model is trained with Adam W optimizer [42] using an initial learning rate of 1e-6. ... In the objective, λa, λc, and λkl > 0 are balancing hyperparameters. We set λc to 0 when proxy c is not available, and otherwise select from {0.01, 0.1, 1} based on validation, same as λa. We use the cyclic schedule from [36] to anneal λkl from 0 to 1 to avoid excessive regularization of the KL term. ... In practice, we found the model is not sensitive to the choices of those hyperparameters. We set each of them to either 0.5 or 1.0 based on validation. |