Improving Biomedical Entity Linking with Cross-Entity Interaction
Authors: Zhenran Xu, Yulin Chen, Baotian Hu
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We perform experiments on 3 biomedical EL datasets: NCBI disease (Do gan, Leaman, and Lu 2014), BC5CDR (Li et al. 2016) and COMETA (Basaldella et al. 2020). We find that, without pretraining, our model achieves the best results on BC5CDR and COMETA compared with all previous results with finetuning only, showing the effectiveness of cross-entity interaction. With pretraining, our model sets new state-of-the-art results on all the datasets above. |
| Researcher Affiliation | Academia | Zhenran Xu, Yulin Chen, Baotian Hu* Harbin Institute of Technology (Shenzhen), Shenzhen, China xuzhenran@stu.hit.edu.cn, 200110528@stu.hit.edu.cn, hubaotian@hit.edu.cn |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks clearly labeled as such. |
| Open Source Code | Yes | Code is available at https://github.com/HITsz-TMG/Prompt-Bio EL. |
| Open Datasets | Yes | We experiment across three datasets: NCBI disease (Do gan, Leaman, and Lu 2014), BC5CDR (Li et al. 2016) and COMETA (Basaldella et al. 2020). |
| Dataset Splits | Yes | Table 1 shows the basic statistics of datasets and the number of entities and synonyms in their corresponding KBs. NCBI disease: Train samples 5,784, Dev samples 787, Test samples 960. BC5CDR: Train samples 9,285, Dev samples 9,515, Test samples 9,654. COMETA: Train samples 13,489, Dev samples 2,176, Test samples 4,350. |
| Hardware Specification | Yes | The models are trained on a single NVIDIA A100 GPU. |
| Software Dependencies | Yes | Our retriever and re-ranker are implemented with Py Torch 1.10.0 (Paszke et al. 2019). |
| Experiment Setup | Yes | We use Adam optimizer (Kingma and Ba 2015) with weight decay set to 0.01 for all experiments. For retriever finetuning, we set the learning rate to 2e-6 and batch size to 2 for all datasets. The number of negatives in N(e) in Equation 4 is set to 15. For mention encoder, the maximum length of the input is 256 tokens. For entity encoder, the maximum length of the input is 128 tokens for BC5CDR, and 64 for NCBI disease and COMETA. For all datasets, we finetune for a total of 30 epochs and choose the best checkpoint based on the development set. For re-ranker pretraining...We set learning rate to 5e-6, batch size to 64, maximum input length to 256 tokens, and the number of pretraining epochs to 15 for Me SH and 10 for SNOMED-CT. For re-ranker finetuning, we set batch size to 16, maximum mention context length to 256 tokens, maximum candidate entity length to 32 tokens. We search learning rate among [5e-6,1e-5,5e-5] based on the development set. Best-performing learning rate is 5e-5 for all datasets. We finetune for a total of 40 epochs, and choose the best checkpoint based on the development set. |