A Causal Lens for Controllable Text Generation

Authors: Zhiting Hu, Li Erran Li

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show significant superiority of the causal approach over previous conditional models for improved control accuracy and reduced bias.
Researcher Affiliation Collaboration Zhiting Hu1,2, Li Erran Li2 1UC San Diego, 2AWS AI, Amazon
Pseudocode No The paper does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not contain any explicit statement about releasing open-source code for the described methodology or a link to a code repository.
Open Datasets Yes Our first dataset is derived from the YELP challenge2 that contains customer reviews of different categories. Sentiment (1:positive vs. 0:negative) is the attribute we aim to control, and the category of review object (1:restaurant vs. 0:others) is the confounding factor. Specifically, we extract a subset of data where 90% restaurant reviews are of positive sentiment, while 90% reviews to other entities (e.g., shopping) are of negative sentiment (thus a 90% correlation strength). We keep the category labels for less than 2% of training data. The resulting data has 510K/6K training/validation examples, wherein 10K training examples have observable confounding category labels3. For evaluation, we further create a balanced test set of 13K examples with correlation strength 50% (i.e., no correlation). Following the previous controllable generation [22, 64], we focus on generating short text, by truncating the output text in the data to 20 tokens at maximum. The second dataset is from the BIOS corpus [10] that contains online biographies with gender and occupation labels.
Dataset Splits Yes The resulting data has 510K/6K training/validation examples... The second dataset is from the BIOS corpus [10]... We randomly split the dataset into 43K training and 2K validation examples, and keep the binary occupation labels for only 3K randomly selected training examples (among which only 5% 3K=150 examples have opposite gender and occupation labels). As above, we further create a balanced test set of 2K examples for evaluation, and truncate the output text to no more than 20 tokens.
Hardware Specification Yes All experiments were conducted on 8 Tesla V100 GPUs.
Software Dependencies No The paper mentions using GPT-2 architecture and Adam W optimizer but does not provide specific version numbers for software dependencies such as programming languages, deep learning frameworks, or libraries.
Experiment Setup Yes The model is trained with Adam W optimizer [42] using an initial learning rate of 1e-6. ... In the objective, λa, λc, and λkl > 0 are balancing hyperparameters. We set λc to 0 when proxy c is not available, and otherwise select from {0.01, 0.1, 1} based on validation, same as λa. We use the cyclic schedule from [36] to anneal λkl from 0 to 1 to avoid excessive regularization of the KL term. ... In practice, we found the model is not sensitive to the choices of those hyperparameters. We set each of them to either 0.5 or 1.0 based on validation.