Neural Text Generation With Unlikelihood Training
Authors: Sean Welleck, Ilia Kulikov, Stephen Roller, Emily Dinan, Kyunghyun Cho, Jason Weston
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We follow a standard language modeling setup from Baevski and Auli (2019) and evaluate our method on the task of sequence completion, detailed below. We show that both token and sequence level unlikelihood training give less repetitive, less dull text while maintaining perplexity, giving superior generations using standard greedy or beam search. According to human evaluations, our approach with standard beam search also outperforms the currently popular decoding methods of nucleus sampling or beam blocking, thus providing a strong alternative to existing techniques. |
| Researcher Affiliation | Collaboration | 1New York University, 2Facebook AI Research, 3CIFAR Azrieli Global Scholar |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and trained models are available at https://github.com/facebookresearch/unlikelihood_training; implemented with Fairseq (Ott et al., 2019). |
| Open Datasets | Yes | We use the Wikitext-103 dataset (Merity et al., 2016), a large-scale collection of Wikipedia articles containing over 100 million words and 260 thousand unique tokens. |
| Dataset Splits | No | The paper states it uses the Wikitext-103 dataset and evaluates on its validation set, but it does not provide specific percentages or sample counts for the training, validation, and test splits, nor does it cite a source that defines these specific splits for reproducibility. |
| Hardware Specification | No | The paper mentions training on "8 GPUs" and later "a single GPU" due to "GPU memory constraints" but does not specify the model or type of GPUs used (e.g., NVIDIA A100, Tesla V100), or any other specific hardware details like CPU or memory. |
| Software Dependencies | No | The paper states "implemented with Fairseq (Ott et al., 2019)" but does not specify a version number for Fairseq or any other software dependency. |
| Experiment Setup | Yes | We train on fixed-length contiguous sequences, in our case of length 1,536... For the token-level losses (LMLE, LUL-token), we train each model on 8 GPUs for a maximum of 150k updates, evaluating on the validation set and saving the model state every 10k updates. Models are fine-tuned for 1,500 total updates. With probability 0.5 an update uses LULS... The experiments use a prefix length k = 50 and continuation length N = 100 for fine-tuning. For deterministic decoding we use greedy search and beam search with beam size 10, and for stochastic decoding we use top-k sampling with k {3, 50} and nucleus sampling with p {0.3, 0.9}. |