DictBERT: Dictionary Description Knowledge Enhanced Language Model Pre-training via Contrastive Learning
Authors: Qianglong Chen, Feng-Lin Li, Guohai Xu, Ming Yan, Ji Zhang, Yin Zhang
IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our approach on a variety of knowledge driven and language understanding tasks, including NER, relation extraction, Commonsense QA, Open Book QA and GLUE. Experimental results demonstrate that our model can significantly improve typical PLMs: it gains a substantial improvement of 0.5%, 2.9%, 9.0%, 7.1% and 3.3% on BERT-large respectively, and is also effective on Ro BERTa-large. |
| Researcher Affiliation | Collaboration | Qianglong Chen1,2 , Feng-Lin Li2 , Guohai Xu2 , Ming Yan2 , Ji Zhang2 , Yin Zhang1 1College of Computer Science and Technology, Zhejiang University, China 2Alibaba Group, China |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement or link indicating that the source code for their method is publicly available. |
| Open Datasets | Yes | To pre-train Dict BERT, we use the Cambridge Dictionary1, which includes 315K en-try words, as our pre-training corpus. 1https://dictionary.cambridge.org. We use Commonsense QA [Talmor et al., 2019] and Open Book QA [Mihaylov et al., 2018] to evaluate the ability of Dict BERT acting as KBs and providing implicit knowledge to downstream tasks. We follow existing knowledge enhanced PLMs such as KEPLER and Know BERT to use GLUE [Wang et al., 2018] to evaluate the general natural language understanding capability of our approach. |
| Dataset Splits | Yes | Table 5: Experimental results on the GLUE development set. The parameter of Dict BERT is based on BERT-large. For pre-training, we use the BERT-large-uncased and Ro BERTa-large model as backbone and set the learning rate to 1e 5, dropout rate to 0.1, max-length of tokens to 128, batch size to 32, and number of epochs to 10. For fine-tuning, we adopt cross-entropy loss as the loss function, set batch size to 32 and number of epochs to 30. We run 5 times for each task and report their average. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU models, CPU types, memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions using "BERT-large-uncased and Ro BERTa-large model as backbone" and "Adam W as the optimizer", but does not provide specific version numbers for these software components or any other libraries/frameworks used. |
| Experiment Setup | Yes | For pre-training, we use the BERT-large-uncased and Ro BERTa-large model as backbone and set the learning rate to 1e 5, dropout rate to 0.1, max-length of tokens to 128, batch size to 32, and number of epochs to 10. We use Adam W as the optimizer. For fine-tuning, we adopt cross-entropy loss as the loss function, set batch size to 32 and number of epochs to 30. We run 5 times for each task and report their average. |