Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Towards Semantics-Enhanced Pre-Training: Can Lexicon Definitions Help Learning Sentence Meanings?
Authors: Xuancheng Ren, Xu Sun, Houfeng Wang, Qun Liu13736-13744
AAAI 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To verify whether the proposed method can enhance the semantic understanding of sentences, we conduct both intrinsic evaluation that inspects knowledge learned by the pre-trained models themselves and extrinsic evaluation on semantics-oriented downstream tasks with fine-tuning. |
| Researcher Affiliation | Collaboration | Xuancheng Ren,1 Xu Sun,1,2 Houfeng Wang,1 Qun Liu3 1 MOE Key Laboratory of Computational Linguistics, School of EECS, Peking University 2 Center for Data Science, Peking University 3 Huawei Noah s Ark Lab |
| Pseudocode | No | The paper describes the methods in text and mathematical formulas but does not include pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code and the appendix are available at https://github.com/lancopku/sempre |
| Open Datasets | Yes | For general-purpose pre-training, we adopt the pre-trained Ro BERTa-base and Ro BERTa-large models (Liu et al. 2019)... They are trained on a combined corpus including fictions, encyclopedia, and news, totaling over 160GB text... For semantics-focused pre-training, the models are trained on word-definition pairs... we extract 0.2M word-definitions and 1.4M word-definition pairs in 23 relations from Word Net (Miller 1995). |
| Dataset Splits | Yes | We adopt early stopping based on validation accuracy and report the results of the bestscoring configuration on the validation set. For the testing protocol, we follow Zhou et al. (2020). |
| Hardware Specification | No | The paper mentions using "computation resources" but does not specify any particular hardware components such as CPU or GPU models, or memory details used for the experiments. |
| Software Dependencies | No | The paper mentions "Our implementation is based on the fairseq (Ott et al. 2019) package" but does not provide a specific version number for fairseq or any other software dependencies. |
| Experiment Setup | Yes | We use a batch size of 2048 sequences, a peak learning rate of 2 × 10−5 with linear warm-up and decay peaked at the 295th update scheduled for at most 6910 updates and keep at most 128 tokens of a sequence. The batch size is 32. Each configuration is run multiple times with different random start. We adopt early stopping based on validation accuracy and report the results of the best-scoring configuration on the validation set. For downstream fine-tuning, following Liu et al. (2019); Bisk et al. (2020), we conduct a grid search with respect to certain hyper-parameters, i.e., the learning rates [1 × 10−5, 2 × 10−5, 3 × 10−5] and the maximum epochs [10, 50]. |