Segatron: Segment-Aware Transformer for Language Modeling and Understanding
Authors: He Bai, Peng Shi, Jimmy Lin, Yuqing Xie, Luchen Tan, Kun Xiong, Wen Gao, Ming Li12526-12534
AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We find that our method can further improve the Transformer-XL base model and large model, achieving 17.1 perplexity on the Wiki Text103 dataset. We further investigate the pre-training masked language modeling task with Segatron. Experimental results show that BERT pre-trained with Segatron (Sega BERT) can outperform BERT with vanilla Transformer on various NLP tasks, and outperforms Ro BERTa on zero-shot sentence representation learning. |
| Researcher Affiliation | Collaboration | He Bai,1 Peng Shi,1 Jimmy Lin,1, 2 Yuqing Xie,1 Luchen Tan,2 Kun Xiong,2 Wen Gao,3 Ming Li1, 2 1David R. Cheriton School of Computer Science, University of Waterloo 2 RSVP.ai 3 School of Electronics Engineering and Computer Science, Peking University |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available on Git Hub.1 1https://github.com/rsvp-ai/segatron |
| Open Datasets | Yes | Dataset Wiki Text-103 is a large word-level dataset... (Merity et al. 2017). For the pre-training corpus we use English Wikipedia and Bookcorpus (Zhu et al. 2015). |
| Dataset Splits | Yes | Dataset Wiki Text-103 is a large word-level dataset... (Merity et al. 2017). There are 103M tokens, 28K articles for training. The General Language Understanding Evaluation (GLUE) benchmark (Wang et al. 2019) is a collection of resources for evaluating natural language understanding systems... We conduct grid search with the GLUE dev set for small data tasks. |
| Hardware Specification | No | The paper mentions 'computing resources' but does not provide specific hardware details such as GPU or CPU models used for the experiments. |
| Software Dependencies | No | The paper mentions software like NLTK and the Hugging Face Transformers library, but does not provide specific version numbers for these or other software dependencies. |
| Experiment Setup | Yes | The base model is 12 layers, 768 hidden size, and 12 self-attention heads. The large model is 24 layers, 1024 hidden size, and 24 self-attention heads. For optimization, we use Adam with learning rate 1e4, β1=0.9, β2=0.999, with learning rate warm-up over the first 1% of the total steps and with linear decay of the learning rate. ... The base model is a 16 layer Transformer with a hidden size of 410 and 10 self-attention heads. This model is trained for 200K steps with a batch size of 64. The large model is an 18 layer Transformer with a hidden size of 1024 and 16 attention heads. This model is trained with 350K steps with a batch size of 128. The sequence length and memory length during training and testing all equal 150 for the base model and 384 for the large model. ... For QQP, MNLI, and QNLI, we use the default hyperparameters: 3e-5 learning rate, 256 batch size, and 3 epochs. ... We conduct grid search with the GLUE dev set for small data tasks: Co LA, MRPC, RTE, SST-2, and STS-B. Our grid search space is as follows: Batch size: 16, 24, 32; Learning rate: 2e-5, 3e-5, 5e-5; Number of epochs: 3-10. ... We fine-tune our Sega BERT model with SQUAD v1.1 (Rajpurkar et al. 2016) for 4 epochs with 128 batch size and 3e-5 learning rate. |