ConvBERT: Improving BERT with Span-based Dynamic Convolution

Authors: Zi-Hang Jiang, Weihao Yu, Daquan Zhou, Yunpeng Chen, Jiashi Feng, Shuicheng Yan

NeurIPS 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments have shown that Conv BERT significantly outperforms BERT and its variants in various downstream tasks, with lower training costs and fewer model parameters. Remarkably, Conv BERTBASE model achieves 86.4 GLUE score, 0.7 higher than ELECTRABASE, using less than 1/4 training cost. Code and pre-trained models will be released 3 .
Researcher Affiliation Collaboration Zihang Jiang1 , Weihao Yu1 , Daquan Zhou1, Yunpeng Chen2, Jiashi Feng1, Shuicheng Yan2 1National University of Singapore, 2Yitu Technology
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code Yes Code and pre-trained models will be released 3 . https://github.com/yitu-opensource/Conv Bert
Open Datasets Yes In this paper, unless otherwise stated, we train the models on an open-sourced dataset Open Web Text [14, 41] (32G) to ease reproduction, which is of similar size with the combination of English Wikipedia and Books Corpus that is used for BERT training. We also show the results of our model trained on the same data as BERT (i.e. Wiki Books) in the Appendix.
Dataset Splits Yes We evaluate our model on the General Language Understanding Evaluation (GLUE) benchmark [53] as well as Question answering task SQu AD [43]. GLUE benchmark includes various tasks which are formatted as single sentence or sentence pair classification. See Appendix for more details of all tasks. SQu AD is a question answering dataset in which each example consists of a context, a question and an answer from the context. The target is to locate the answer with the given context and question. In SQu AD V1.1, the answers are always contained in the context, where in V2.0 some answers are not included in the context. We measure accuracy for MNLI, QNLI, QQP, RTE, SST, Spearman correlation for STS and Matthews correlation for Co LA. The GLUE score is the average of all 8 tasks. Since there is nearly no single model submission on SQu AD leaderboard,4 we only compare ours with other models on the development set. We report the Exact Match and F1 score on the development set of both v1.1 and v2.0.
Hardware Specification No The paper mentions 'computation resource limitations' and acknowledges support for 'computational resources' but does not specify any particular CPU, GPU, or other hardware model used for the experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., 'PyTorch 1.9', 'TensorFlow 2.0').
Experiment Setup Yes During pre-trianing, the batch size is set to 128 and 256 respectively for the small-sized and base-sized model. An input sequence of length 128 is used to update the model. We show the results of these models after pre-training for 1M updates as well as pre-training longer for 4M updates. More detailed hyper-parameters for pre-training and fine-tuning are listed in the Appendix. It can be observed that a larger kernel gives better results as long as the receptive field has not covered the whole input sentence. However when the kernel size is large enough and the receptive field covers all the input tokens, the benefit of using large kernel size diminishes. In later experiments, if not otherwise stated, we set the convolution kernel size as 9 for all dynamic convolution since it gives the best result.