Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
XLNet: Generalized Autoregressive Pretraining for Language Understanding
Authors: Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R. Salakhutdinov, Quoc V. Le
NeurIPS 2019 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, under comparable experiment setting, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.1. 3 Experiments |
| Researcher Affiliation | Collaboration | Zhilin Yang 1, Zihang Dai 12, Yiming Yang1, Jaime Carbonell1, Ruslan Salakhutdinov1, Quoc V. Le2 1Carnegie Mellon University, 2Google AI Brain Team EMAIL, EMAIL |
| Pseudocode | No | The paper describes the method using equations and diagrams (Figure 1), but does not provide any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | 1Pretrained models and code are available at https://github.com/zihangdai/xlnet |
| Open Datasets | Yes | we use the Books Corpus [40] and English Wikipedia as part of our pretraining data... In addition, we include Giga5 (16GB text) [26], Clue Web 2012-B (extended from [5]), and Common Crawl [6] for pretraining. |
| Dataset Splits | Yes | We use the provided training/validation/test splits. |
| Hardware Specification | Yes | Specifically, we train on 512 TPU v3 chips for 500K steps with an Adam weight decay optimizer, linear learning rate decay, and a batch size of 8192, which takes about 5.5 days. |
| Software Dependencies | No | The paper mentions 'SentencePiece [17]' for tokenization, but does not provide specific version numbers for any software dependencies or libraries (e.g., Python, TensorFlow/PyTorch versions). |
| Experiment Setup | Yes | During pretraining, we always use a full sequence length of 512. ... we train on 512 TPU v3 chips for 500K steps with an Adam weight decay optimizer, linear learning rate decay, and a batch size of 8192... we set the partial prediction constant K as 6 (see Section 2.3). Our finetuning procedure follows BERT [10] except otherwise specified3. We employ an idea of span-based prediction, where we first sample a length L [1, , 5], and then randomly select a consecutive span of L tokens as prediction targets within a context of (KL) tokens. Hyperparameters for pretraining and finetuning are in Appendix A.4. |