XLNet: Generalized Autoregressive Pretraining for Language Understanding
Authors: Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R. Salakhutdinov, Quoc V. Le
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, under comparable experiment setting, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.1. 3 Experiments |
| Researcher Affiliation | Collaboration | Zhilin Yang 1, Zihang Dai 12, Yiming Yang1, Jaime Carbonell1, Ruslan Salakhutdinov1, Quoc V. Le2 1Carnegie Mellon University, 2Google AI Brain Team {zhiliny,dzihang,yiming,jgc,rsalakhu}@cs.cmu.edu, qvl@google.com |
| Pseudocode | No | The paper describes the method using equations and diagrams (Figure 1), but does not provide any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | 1Pretrained models and code are available at https://github.com/zihangdai/xlnet |
| Open Datasets | Yes | we use the Books Corpus [40] and English Wikipedia as part of our pretraining data... In addition, we include Giga5 (16GB text) [26], Clue Web 2012-B (extended from [5]), and Common Crawl [6] for pretraining. |
| Dataset Splits | Yes | We use the provided training/validation/test splits. |
| Hardware Specification | Yes | Specifically, we train on 512 TPU v3 chips for 500K steps with an Adam weight decay optimizer, linear learning rate decay, and a batch size of 8192, which takes about 5.5 days. |
| Software Dependencies | No | The paper mentions 'SentencePiece [17]' for tokenization, but does not provide specific version numbers for any software dependencies or libraries (e.g., Python, TensorFlow/PyTorch versions). |
| Experiment Setup | Yes | During pretraining, we always use a full sequence length of 512. ... we train on 512 TPU v3 chips for 500K steps with an Adam weight decay optimizer, linear learning rate decay, and a batch size of 8192... we set the partial prediction constant K as 6 (see Section 2.3). Our finetuning procedure follows BERT [10] except otherwise specified3. We employ an idea of span-based prediction, where we first sample a length L [1, , 5], and then randomly select a consecutive span of L tokens as prediction targets within a context of (KL) tokens. Hyperparameters for pretraining and finetuning are in Appendix A.4. |