Continual Learning for Named Entity Recognition

Authors: Natawut Monaikul, Giuseppe Castellucci, Simone Filice, Oleg Rokhlenko13570-13577

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that this approach allows the student model to progressively learn to identify new entity types without forgetting the previously learned ones. We also present a comparison with multiple strong baselines to demonstrate that our approach is superior for continually updating an NER model.
Researcher Affiliation Collaboration Natawut Monaikul ,1 Giuseppe Castellucci,2 Simone Filice,2 Oleg Rokhlenko2 1University of Illinois at Chicago, Chicago, IL, USA 2Amazon, Seattle, WA, USA
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper does not include an explicit statement about the release of its own source code or a link to a code repository for the described methodology.
Open Datasets Yes To evaluate our approach, we used two well-known NER datasets: Co NLL-03 English NER (Tjong Kim Sang and De Meulder 2003) and Onto Notes (Hovy et al. 2006).
Dataset Splits Yes We divided the official training and validation sets of Co NLL-03 and Onto Notes into four and six disjoint subsets, D1, D2, . . ., respectively: each Di is annotated only for the entity type ei. We first train an initial model M1 on D1 for e1. This model becomes the teacher for e1 with which we train a student model M2 on the second slice D2, which is labeled for e2 only: M2 thus learns to tag both e1 and e2. We repeat this process for each slice Di, i.e., training a new student on a new slice using the previous trained model as the teacher for the previously learned labels. At each step i, we use the i-th slice of the validation set for early stopping and evaluate the resulting model Mi on the official test set annotated for the entity types {e1, ..., ei}.
Hardware Specification Yes training was performed on a single Nvidia V100 GPU.
Software Dependencies No The paper mentions 'Pytorch (Paszke et al. 2017)' and 'BERT Huggingface implementation (Wolf et al. 2019)' but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes After initial experimentation with different hyperparameters, we chose to train the models with a batch size of 32, a max sentence length of 50 tokens, and a learning rate of 5e-5 for 20 epochs with early stopping (patience=3). For all student models, a temperature Tm = 2 was used, and α = β = 1 for the weighted sum of the losses.