reproducibilityindex.ai

Searching for Optimal Subword Tokenization in Cross-domain NER

Authors: Ruotian Ma, Yiding Tan, Xin Zhou, Xuanting Chen, Di Liang, Sirui Wang, Wei Wu, Tao Gui

IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results show the effectiveness of the proposed method based on BERT-tagger on four benchmark NER datasets.
Researcher Affiliation	Collaboration	1School of Computer Science, Fudan University, Shanghai, China 2Institute of Modern Languages and Linguistics, Fudan University, Shanghai, China 3Meituan Inc., Beijing, China
Pseudocode	No	The paper describes the approach using mathematical formulations and descriptive text, but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at https://github.com/rtmaww/X-Piece.
Open Datasets	Yes	We conduct experiments on four commonly used NER datasets from different domains: Co NLL 03 from the newswire domain, Twitter from the social media domain, Webpages from the web domain, and Onto Notes 5.0
Dataset Splits	No	Table 1 lists 'Train' and 'Test' dataset sizes (e.g., 'Co NLL 03 News 3 14.0k 3.5k'), but there is no explicit mention of validation set sizes or how they were obtained/used for reproduction.
Hardware Specification	Yes	Specifically, we test the computation time against different corpus size on Intel Xeon Platinum 8260, 2.40GHz.
Software Dependencies	No	The paper mentions using 'Huggingface' and 'Mind Spore' as implementation frameworks, but does not provide specific version numbers for any software dependencies.
Experiment Setup	No	The paper describes the models and datasets used but does not provide specific details on experimental setup parameters such as hyperparameters (e.g., learning rate, batch size, number of epochs) or optimizer settings in the main text.