Optimizing Bi-Encoder for Named Entity Recognition via Contrastive Learning

Authors: Sheng Zhang, Hao Cheng, Jianfeng Gao, Hoifung Poon

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that our method performs well in both supervised and distantly supervised settings, for nested and flat NER alike, establishing new state of the art across standard datasets in the general domain (e.g., ACE2004, ACE2005, Co NLL2003) and high-value verticals such as biomedicine (e.g., GENIA, NCBI, BC5CDR, JNLPBA).
Researcher Affiliation Industry Sheng Zhang, Hao Cheng, Jianfeng Gao, and Hoifung Poon Microsoft Research
Pseudocode Yes Algorithm 1: Inference for BINDER.
Open Source Code Yes We release the code at github.com/microsoft/binder.
Open Datasets Yes For nested NER, we consider ACE2004, ACE2005, and GENIA (Kim et al., 2003). For flat NER, we consider Co NLL2003 (Tjong Kim Sang & De Meulder, 2003) as well as five biomedical NER datasets from the BLURB benchmark (Gu et al., 2021): BC5-chem/disease (Li et al., 2016), NCBI (Do gan et al., 2014), BC2GM (Smith et al., 2008), and JNLPBA (Collier & Kim, 2004). In the distantly supervised setting, we consider BC5CDR (Li et al., 2016).
Dataset Splits Yes We follow Luan et al. (2018) to split ACE2004 into 5 folds, and ACE2005 into train, development and test sets. GENIA... follow Finkel & Manning (2009) and Lu & Roth (2015) to split it into 80%/10%/10% train/dev/test splits. We use the standard train, development, and test splits.
Hardware Specification No The paper mentions running experiments "on GPU" in Appendix A.1, but does not provide specific hardware details such as GPU models, CPU types, or memory specifications.
Software Dependencies No The paper states: "We implement our models based on the Hugging Face Transformers library (Wolf et al., 2020)" and mentions specific models like BERT and Bio BERT, but it does not provide specific version numbers for the Hugging Face Transformers library or other software components/dependencies used.
Experiment Setup Yes The linear layer output size is 128; the width embedding size is 128; the initial temperatures are 0.07. We train our models with the Adam W optimizer (Loshchilov & Hutter, 2017) of a linear scheduler and dropout of 0.1. The entity start/end/span contrastive loss weights are set to α = 0.2, γ = 0.2, λ = 0.6, and the same loss weights are chosen for thresholding contrastive learning. For base encoders, we train our models for 20 epochs with a learning rate of 3e-5 and a batch size of 8 sequences with the maximum token length of N = 128. For large encoders, we train our models for 40 epochs with a learning rate of 3e-5 and a batch size of 16 sequences with the maximum token length of N = 256. The maximum token length for entity spans is set to 30. We use early stop with a patience of 10 in the distantly supervised setting. Validation is done at every 50 steps of training, and we adopt the models that have the best performance on the development set.