EASAL: Entity-Aware Subsequence-Based Active Learning for Named Entity Recognition
Authors: Yang Liu, Jinpeng Hu, Zhihong Chen, Xiang Wan, Tsung-Hui Chang
AAAI 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on both news and biomedical datasets demonstrate the effectiveness of our proposed method. The code is released at https://github.com/lylylylylyly/EASAL. Experiments Datasets We experiment on three biomedical NER datasets and one news NER dataset that are widely used in previous studies, including NCBI-disease (Do gan, Leaman, and Lu 2014), BC5CDR-disease (Li et al. 2016), BC5CDR-chemical (Li et al. 2016), and Conll2003 (Sang and De Meulder 2003). |
| Researcher Affiliation | Collaboration | 1Shenzhen Research Institute of Big Data 2Chinese University of Hong Kong, Shenzhen, China 3 Pazhou Lab, Guangzhou, 510330, China {yangliu5, jinpenghu, zhihongchen}@link.cuhk.edu.cn, wanxiang@sribd.cn, changtsunghui@cuhk.edu.cn |
| Pseudocode | Yes | Algorithm 1: Entity-Aware Subsequence-Based AL |
| Open Source Code | Yes | The code is released at https://github.com/lylylylylyly/EASAL. |
| Open Datasets | Yes | We experiment on three biomedical NER datasets and one news NER dataset that are widely used in previous studies, including NCBI-disease (Do gan, Leaman, and Lu 2014), BC5CDR-disease (Li et al. 2016), BC5CDR-chemical (Li et al. 2016), and Conll2003 (Sang and De Meulder 2003). |
| Dataset Splits | Yes | The initial data splits used for training the model M are set at 1% of randomly sampled data, which are following the splitting techniques used in the existing literature on AL (Shen et al. 2017; Hazra et al. 2021)At each query round, the model selects 1500 tokens, 2250 tokens, and 2000 tokens respectively for NCBI-disease, BC5CDR (disease and chemistry), and Conll2003 that are about 1% of the total number of tokens of NCBI-disease, BC5CDR, and Conll2003. ...To be noted, we merge the training set and validation set together for the data pool generation and training following previous studies (Lee et al. 2020). Note that we do not set a validation set. We use the training loss at each query round to select the best model for testing. |
| Hardware Specification | No | The paper does not provide specific details about the hardware (e.g., GPU models, CPU types) used for running the experiments. It only mentions general computing environments like "using the whole train set for training" or "on Conll2003". |
| Software Dependencies | No | The paper mentions using BERT and Bio BERT but does not specify their versions or any other software dependencies with version numbers. |
| Experiment Setup | Yes | The initial data splits used for training the model M are set at 1% of randomly sampled data...At each query round, the model selects 1500 tokens, 2250 tokens, and 2000 tokens respectively for NCBI-disease, BC5CDR (disease and chemistry), and Conll2003...We set the initial training epoch as 10, which uses randomly 1% of the data, and the subsequent training epochs are all 3. The query rounds that stop training are uniformly set to 30...The whole dataset is used for training with an epoch of 30 and a learning rate of 5e-5. |