reproducibilityindex.ai

XLNet: Generalized Autoregressive Pretraining for Language Understanding

Authors: Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R. Salakhutdinov, Quoc V. Le

NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, under comparable experiment setting, XLNet outperforms BERT on 20 tasks, often by a large margin, including question answering, natural language inference, sentiment analysis, and document ranking.1. 3 Experiments
Researcher Affiliation	Collaboration	Zhilin Yang 1, Zihang Dai 12, Yiming Yang1, Jaime Carbonell1, Ruslan Salakhutdinov1, Quoc V. Le2 1Carnegie Mellon University, 2Google AI Brain Team {zhiliny,dzihang,yiming,jgc,rsalakhu}@cs.cmu.edu, qvl@google.com
Pseudocode	No	The paper describes the method using equations and diagrams (Figure 1), but does not provide any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	1Pretrained models and code are available at https://github.com/zihangdai/xlnet
Open Datasets	Yes	we use the Books Corpus [40] and English Wikipedia as part of our pretraining data... In addition, we include Giga5 (16GB text) [26], Clue Web 2012-B (extended from [5]), and Common Crawl [6] for pretraining.
Dataset Splits	Yes	We use the provided training/validation/test splits.
Hardware Specification	Yes	Speciﬁcally, we train on 512 TPU v3 chips for 500K steps with an Adam weight decay optimizer, linear learning rate decay, and a batch size of 8192, which takes about 5.5 days.
Software Dependencies	No	The paper mentions 'SentencePiece [17]' for tokenization, but does not provide specific version numbers for any software dependencies or libraries (e.g., Python, TensorFlow/PyTorch versions).
Experiment Setup	Yes	During pretraining, we always use a full sequence length of 512. ... we train on 512 TPU v3 chips for 500K steps with an Adam weight decay optimizer, linear learning rate decay, and a batch size of 8192... we set the partial prediction constant K as 6 (see Section 2.3). Our ﬁnetuning procedure follows BERT [10] except otherwise speciﬁed3. We employ an idea of span-based prediction, where we ﬁrst sample a length L [1, , 5], and then randomly select a consecutive span of L tokens as prediction targets within a context of (KL) tokens. Hyperparameters for pretraining and ﬁnetuning are in Appendix A.4.