Searching for Optimal Subword Tokenization in Cross-domain NER

Authors: Ruotian Ma, Yiding Tan, Xin Zhou, Xuanting Chen, Di Liang, Sirui Wang, Wei Wu, Tao Gui

IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show the effectiveness of the proposed method based on BERT-tagger on four benchmark NER datasets.
Researcher Affiliation Collaboration 1School of Computer Science, Fudan University, Shanghai, China 2Institute of Modern Languages and Linguistics, Fudan University, Shanghai, China 3Meituan Inc., Beijing, China
Pseudocode No The paper describes the approach using mathematical formulations and descriptive text, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/rtmaww/X-Piece.
Open Datasets Yes We conduct experiments on four commonly used NER datasets from different domains: Co NLL 03 from the newswire domain, Twitter from the social media domain, Webpages from the web domain, and Onto Notes 5.0
Dataset Splits No Table 1 lists 'Train' and 'Test' dataset sizes (e.g., 'Co NLL 03 News 3 14.0k 3.5k'), but there is no explicit mention of validation set sizes or how they were obtained/used for reproduction.
Hardware Specification Yes Specifically, we test the computation time against different corpus size on Intel Xeon Platinum 8260, 2.40GHz.
Software Dependencies No The paper mentions using 'Huggingface' and 'Mind Spore' as implementation frameworks, but does not provide specific version numbers for any software dependencies.
Experiment Setup No The paper describes the models and datasets used but does not provide specific details on experimental setup parameters such as hyperparameters (e.g., learning rate, batch size, number of epochs) or optimizer settings in the main text.