XLNet: Generalized Autoregressive Pretraining for Language Understanding

Authors: Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R. Salakhutdinov, Quoc V. Le

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, under comparable experiment setting, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.1. 3 Experiments
Researcher Affiliation Collaboration Zhilin Yang 1, Zihang Dai 12, Yiming Yang1, Jaime Carbonell1, Ruslan Salakhutdinov1, Quoc V. Le2 1Carnegie Mellon University, 2Google AI Brain Team {zhiliny,dzihang,yiming,jgc,rsalakhu}@cs.cmu.edu, qvl@google.com
Pseudocode No The paper describes the method using equations and diagrams (Figure 1), but does not provide any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes 1Pretrained models and code are available at https://github.com/zihangdai/xlnet
Open Datasets Yes we use the Books Corpus [40] and English Wikipedia as part of our pretraining data... In addition, we include Giga5 (16GB text) [26], Clue Web 2012-B (extended from [5]), and Common Crawl [6] for pretraining.
Dataset Splits Yes We use the provided training/validation/test splits.
Hardware Specification Yes Specifically, we train on 512 TPU v3 chips for 500K steps with an Adam weight decay optimizer, linear learning rate decay, and a batch size of 8192, which takes about 5.5 days.
Software Dependencies No The paper mentions 'SentencePiece [17]' for tokenization, but does not provide specific version numbers for any software dependencies or libraries (e.g., Python, TensorFlow/PyTorch versions).
Experiment Setup Yes During pretraining, we always use a full sequence length of 512. ... we train on 512 TPU v3 chips for 500K steps with an Adam weight decay optimizer, linear learning rate decay, and a batch size of 8192... we set the partial prediction constant K as 6 (see Section 2.3). Our finetuning procedure follows BERT [10] except otherwise specified3. We employ an idea of span-based prediction, where we first sample a length L [1, , 5], and then randomly select a consecutive span of L tokens as prediction targets within a context of (KL) tokens. Hyperparameters for pretraining and finetuning are in Appendix A.4.