ALBERT: A Lite BERT for Self-supervised Learning of Language Representations

Authors: Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Comprehensive empirical evidence shows that our proposed methods lead to models that scale much better compared to the original BERT.
Researcher Affiliation Collaboration 1Google Research 2Toyota Technological Institute at Chicago
Pseudocode No The paper describes the model architecture and techniques in text and tables but does not include any pseudocode or algorithm blocks.
Open Source Code Yes The code and the pretrained models are available at https://github.com/google-research/ALBERT.
Open Datasets Yes To keep the comparison as meaningful as possible, we follow the BERT (Devlin et al., 2019) setup in using the BOOKCORPUS (Zhu et al., 2015) and English Wikipedia (Devlin et al., 2019) for pretraining baseline models.
Dataset Splits Yes To monitor the training progress, we create a development set based on the development sets from SQu AD and RACE using the same procedure as in Sec. 4.1. We report accuracies for both MLM and sentence classification tasks.
Hardware Specification Yes Training was done on Cloud TPU V3. The number of TPUs used for training ranged from 64 to 512, depending on model size.
Software Dependencies No The paper mentions tools like Sentence Piece and components like LAMB optimizer, but does not provide specific version numbers for any software dependencies required to replicate the experiment.
Experiment Setup Yes All the model updates use a batch size of 4096 and a LAMB optimizer with learning rate 0.00176 (You et al., 2019). We train all models for 125,000 steps unless otherwise specified. (Section 4.1) and "Hyperparameters for downstream tasks are shown in Table 14." (Appendix A.4).