Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Residual Energy-Based Models for Text

Authors: Anton Bakhtin, Yuntian Deng, Sam Gross, Myle Ott, Marc'Aurelio Ranzato, Arthur Szlam

JMLR 2021 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We find experimentally that the answer is affirmative when we have access to the training data for the model, and guardedly affirmative even if we do not. This suggests that the auto-regressive models can be improved by incorporating the (globally normalized) discriminators into the generative process. We give a formalism for this using the Energy-Based Model framework, and show that it indeed improves the results of the generative models, measured both in terms of perplexity and in terms of human evaluation.
Researcher Affiliation Collaboration Anton Bakhtin EMAIL Facebook AI Research 770 Broadway, New York, NY 10003 U.S.A. Yuntian Deng EMAIL Harvard University 33 Oxford St., Cambridge, MA 02138 U.S.A. Sam Gross EMAIL Myle Ott EMAIL Marc Aurelio Ranzato EMAIL Arthur Szlam EMAIL
Pseudocode Yes Algorithm 1: Top-k Joint Sampling
Open Source Code No The paper does not explicitly provide a link to the authors' own source code for the methodology described. It mentions using models from the Hugging Face repository and the OpenAI GPT-2 repository, and a link to NVIDIA/apex for mixed precision training, but these are third-party resources or tools used, not the authors' implementation code for the residual EBMs.
Open Datasets Yes Books: The Toronto books corpus described in Zhu et al. (2015); Kiros et al. (2015), which consists of fiction books in 16 different genres, totaling about half a billion words. CCNews: We collect a de-duplicated subset of the English portion of the Common Crawl news dataset (Nagel, 2016), which totals around 16 Billion words. Wikitext: The wikitext103 dataset from Merity et al. (2016), which consists of 103 million words from English Wikipedia articles. Finally, there has been a release of a dataset of the GPT-2 language model generations (Radford and Wu, 2019) for the purpose of training discriminators capable of detecting machine-generated text.
Dataset Splits Yes On Wikitext and Books, we extract positive sequences from windows of text that are 160 tokens long with a stride of 40. On the larger CCNews we do the same except that we stride by 160 tokens. This protocol to mine positives is used both at training and test time, although at test time we limit the evaluation to 60,000 randomly chosen positive samples. Note that each corpus has distinct training and test parts. As a result, even when Ctrain = Ctest, the discriminator is tested using positives and negatives derived from the test part of Ctest.
Hardware Specification Yes We use data-parallel synchronous multi-GPU training with up to 24 nodes, each with 8 Nvidia V100 GPUs.
Software Dependencies No All models are implemented using the Py Torch framework (Paszke et al., 2017) and are optimized using Adam (Kingma and Ba, 2015). To improve training speed, we use mixed precision training1. Following common practice we clip the norm of the gradient vector (Pascanu et al., 2013). More details about hyper-parameter setting can be found in Appendix Table 11, while Table 3 reports the number of parameters of each classifier. The paper mentions PyTorch and Adam, and mixed precision training (linking to NVIDIA/apex), but does not provide specific version numbers for PyTorch or other key software libraries used for their implementation.
Experiment Setup Yes We use Adam (Kingma and Ba, 2015) optimizer with warmup. We use data-parallel synchronous multi-GPU training with up to 24 nodes, each with 8 Nvidia V100 GPUs. To improve training speed, we use mixed precision training1. Following common practice we clip the norm of the gradient vector (Pascanu et al., 2013). More details about hyper-parameter setting can be found in Appendix Table 11, while Table 3 reports the number of parameters of each classifier. All models are implemented using the Py Torch framework (Paszke et al., 2017) and are optimized using Adam (Kingma and Ba, 2015). To train our biggest models (Uni T and Bi T) we used several machines each with 8 GPUs in synchronous mode using data parallelism. The resulting large batch size speeds up training when combined with float16 reduced precision and cosine scheduling of the learning rate without any restarts (Loshchilov and Hutter, 2016), i.e. we decay the learning rate to zero over the course of max steps updates and then stop training. Using these methods, we reduced training time by five times compared to a single node training. For simpler models we used a single node with up to 8 GPUs and inverse square root decay. Tables 11 and 12 provide specific hyper-parameter values for 'max lr', 'bsz (/GPU)', 'GPUs', 'fp16', 'warmup steps', 'max steps', and 'max grad norm'.