A Distributional Approach to Controlled Text Generation
Authors: Muhammad Khalifa, Hady Elsahar, Marc Dymetman
ICLR 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct a first set of experiments over pointwise constraints showing the advantages of our approach over a set of baselines, in terms of obtaining a controlled LM balancing constraint satisfaction with divergence from the initial LM. We then perform experiments over distributional constraints, a unique feature of our approach, demonstrating its potential as a remedy to the problem of Bias in Language Models. Through an ablation study, we show the effectiveness of our adaptive technique for obtaining faster convergence. |
| Researcher Affiliation | Collaboration | Muhammad Khalifa Cairo University Hady Elsahar Naver Labs Europe Marc Dymetman Naver Labs Europe {hady.elsahar,marc.dymetman}@naverlabs.com m.khalifa@grad.fci-cu.edu.eg |
| Pseudocode | Yes | Algorithm 1 Computing λ |
| Open Source Code | Yes | Code available on https://github.com/naver/gdc |
| Open Datasets | Yes | For distributional and hybrid experiments, we fine-tune GPT-2 small (117M params) to produce biographies on a dataset of 700K Wikipedia biographies (Lebret et al., 2016) which we refer to as GPT-2bio. |
| Dataset Splits | Yes | We end up with a total of 4600 samples out of which we use 500 for validation and the rest for fine-tuning. |
| Hardware Specification | Yes | Each training required 2 Nvidia V100 GPUs, the longest model took 72 hours to train. |
| Software Dependencies | No | The paper mentions software like PyTorch and Hugging Face library, but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | A list of the hyperparameters used for GDC and baselines is given in table 5. K refers to the number of gradient steps per iteration in Algorithm 2. N refers to the number of samples required and µtolerance to the minimum tolerated error || µ ˆµ(λ)||2 2 while optimizing λ, and λlearning is the SGD step size for updating λ in Algorithm 1. |