Optimizing Bi-Encoder for Named Entity Recognition via Contrastive Learning
Authors: Sheng Zhang, Hao Cheng, Jianfeng Gao, Hoifung Poon
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that our method performs well in both supervised and distantly supervised settings, for nested and flat NER alike, establishing new state of the art across standard datasets in the general domain (e.g., ACE2004, ACE2005, Co NLL2003) and high-value verticals such as biomedicine (e.g., GENIA, NCBI, BC5CDR, JNLPBA). |
| Researcher Affiliation | Industry | Sheng Zhang, Hao Cheng, Jianfeng Gao, and Hoifung Poon Microsoft Research |
| Pseudocode | Yes | Algorithm 1: Inference for BINDER. |
| Open Source Code | Yes | We release the code at github.com/microsoft/binder. |
| Open Datasets | Yes | For nested NER, we consider ACE2004, ACE2005, and GENIA (Kim et al., 2003). For flat NER, we consider Co NLL2003 (Tjong Kim Sang & De Meulder, 2003) as well as five biomedical NER datasets from the BLURB benchmark (Gu et al., 2021): BC5-chem/disease (Li et al., 2016), NCBI (Do gan et al., 2014), BC2GM (Smith et al., 2008), and JNLPBA (Collier & Kim, 2004). In the distantly supervised setting, we consider BC5CDR (Li et al., 2016). |
| Dataset Splits | Yes | We follow Luan et al. (2018) to split ACE2004 into 5 folds, and ACE2005 into train, development and test sets. GENIA... follow Finkel & Manning (2009) and Lu & Roth (2015) to split it into 80%/10%/10% train/dev/test splits. We use the standard train, development, and test splits. |
| Hardware Specification | No | The paper mentions running experiments "on GPU" in Appendix A.1, but does not provide specific hardware details such as GPU models, CPU types, or memory specifications. |
| Software Dependencies | No | The paper states: "We implement our models based on the Hugging Face Transformers library (Wolf et al., 2020)" and mentions specific models like BERT and Bio BERT, but it does not provide specific version numbers for the Hugging Face Transformers library or other software components/dependencies used. |
| Experiment Setup | Yes | The linear layer output size is 128; the width embedding size is 128; the initial temperatures are 0.07. We train our models with the Adam W optimizer (Loshchilov & Hutter, 2017) of a linear scheduler and dropout of 0.1. The entity start/end/span contrastive loss weights are set to α = 0.2, γ = 0.2, λ = 0.6, and the same loss weights are chosen for thresholding contrastive learning. For base encoders, we train our models for 20 epochs with a learning rate of 3e-5 and a batch size of 8 sequences with the maximum token length of N = 128. For large encoders, we train our models for 40 epochs with a learning rate of 3e-5 and a batch size of 16 sequences with the maximum token length of N = 256. The maximum token length for entity spans is set to 30. We use early stop with a patience of 10 in the distantly supervised setting. Validation is done at every 50 steps of training, and we adopt the models that have the best performance on the development set. |