Towards Semantics-Enhanced Pre-Training: Can Lexicon Definitions Help Learning Sentence Meanings?

Authors: Xuancheng Ren, Xu Sun, Houfeng Wang, Qun Liu13736-13744

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To verify whether the proposed method can enhance the semantic understanding of sentences, we conduct both intrinsic evaluation that inspects knowledge learned by the pre-trained models themselves and extrinsic evaluation on semantics-oriented downstream tasks with fine-tuning.
Researcher Affiliation Collaboration Xuancheng Ren,1 Xu Sun,1,2 Houfeng Wang,1 Qun Liu3 1 MOE Key Laboratory of Computational Linguistics, School of EECS, Peking University 2 Center for Data Science, Peking University 3 Huawei Noah s Ark Lab
Pseudocode No The paper describes the methods in text and mathematical formulas but does not include pseudocode or algorithm blocks.
Open Source Code Yes The code and the appendix are available at https://github.com/lancopku/sempre
Open Datasets Yes For general-purpose pre-training, we adopt the pre-trained Ro BERTa-base and Ro BERTa-large models (Liu et al. 2019)... They are trained on a combined corpus including fictions, encyclopedia, and news, totaling over 160GB text... For semantics-focused pre-training, the models are trained on word-definition pairs... we extract 0.2M word-definitions and 1.4M word-definition pairs in 23 relations from Word Net (Miller 1995).
Dataset Splits Yes We adopt early stopping based on validation accuracy and report the results of the bestscoring configuration on the validation set. For the testing protocol, we follow Zhou et al. (2020).
Hardware Specification No The paper mentions using "computation resources" but does not specify any particular hardware components such as CPU or GPU models, or memory details used for the experiments.
Software Dependencies No The paper mentions "Our implementation is based on the fairseq (Ott et al. 2019) package" but does not provide a specific version number for fairseq or any other software dependencies.
Experiment Setup Yes We use a batch size of 2048 sequences, a peak learning rate of 2 × 10−5 with linear warm-up and decay peaked at the 295th update scheduled for at most 6910 updates and keep at most 128 tokens of a sequence. The batch size is 32. Each configuration is run multiple times with different random start. We adopt early stopping based on validation accuracy and report the results of the best-scoring configuration on the validation set. For downstream fine-tuning, following Liu et al. (2019); Bisk et al. (2020), we conduct a grid search with respect to certain hyper-parameters, i.e., the learning rates [1 × 10−5, 2 × 10−5, 3 × 10−5] and the maximum epochs [10, 50].