Chinese Spelling Correction as Rephrasing Language Model

Authors: Linfeng Liu, Hongqiu Wu, Hai Zhao

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our following experiments show that sequence tagging does not make good use of the benefits from pre-training. The process of rephrasing can be modeled based on the auto-regressive architecture with a decoder to generate the output characters one by one, e.g. GPT (Brown et al. 2020). Specifically, we concatenate the source characters X and the target characters Y as the input sentence, i.e. {x1, x2, , xn, s , y1, y2, , yn, eos }, where s and eos refers to the separate token and wrap token, and train the model to predict all the target characters yi autoregressively. Hence, rephrasing-based spelling correction seeks to solve the following probability for yi, i >= 1: P(yi|X) P(yi|X, y1, y2, , yi 1). (2) Rephrasing Language Model Based on the BERT-based architecture, we propose Rephrasing Language Model (Re LM), a non-auto-regressive rephrasing model. Experiment In this section, we compare Re LM with a line of taggingbased methods on existing benchmarks. We also evaluate the CSC performance in multi-task learning, where all the models are jointly trained on three different tasks, CSC, semantic similarity, and news classification. Dataset ECSpell ECSpell (Lv et al. 2022) is a CSC benchmark with three domains, LAW (1,960 training and 500 test samples), MED (3,000 training and 500 test samples), and ODW (1,728 training and 500 test samples). LEMON Large-scale multi-domain dataset with natural spelling errors (LEMON) (Wu et al. 2023c) is a novel CSC benchmarks with diverse real-life spelling errors. It spans 7 different domains with totally 22,252 test samples. It typically measures the open-domain generalizability of a CSC model in a zero-shot setting.
Researcher Affiliation Academia Linfeng Liu*, Hongqiu Wu*, Hai Zhao Department of Computer Science and Engineering, Shanghai Jiao Tong University Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, China {linfengliu, wuhongqiu}@sjtu.edu.cn, zhaohai@cs.sjtu.edu.cn
Pseudocode No The paper describes the Re LM model and its training process textually and with a diagram (Figure 2), but it does not include a formal pseudocode or algorithm block.
Open Source Code Yes 1https://github.com/Claude-Liu/Re LM 2https://github.com/gingasan/lemon
Open Datasets Yes Dataset ECSpell ECSpell (Lv et al. 2022) is a CSC benchmark with three domains, LAW (1,960 training and 500 test samples), MED (3,000 training and 500 test samples), and ODW (1,728 training and 500 test samples). LEMON Large-scale multi-domain dataset with natural spelling errors (LEMON) (Wu et al. 2023c) is a novel CSC benchmarks with diverse real-life spelling errors. It spans 7 different domains with totally 22,252 test samples. It typically measures the open-domain generalizability of a CSC model in a zero-shot setting. SIGHAN SIGHAN (Tseng et al. 2015) is a CSC benchmark collected from the Chinese essays written by foreign speakers. AFQMC Ant Financial Question Matching (AFQMC) (Xu et al. 2020) is a Chinese semantic similarity dataset that requires the model to predict whether the given two questions are semantically similar. It contains 34,334 training samples and 3,861 test samples. TNEWS Tou Tiao Text Classification for News Titles (TNEWS) (Xu et al. 2020) is a text classification dataset, requiring to link each given title to 15 news categories. It contains 53,360 training samples and 10,000 test samples.
Dataset Splits No The paper provides training and test sample counts for datasets like ECSpell (e.g., "LAW (1,960 training and 500 test samples)") and AFQMC ("34,334 training samples and 3,861 test samples"). While it describes fine-tuning steps, it does not explicitly state the use of a separate validation split with numerical details or splitting methodology for validation.
Hardware Specification Yes We train the model with the batch size 4096 and learning rate 5e-5 on 8 A800 sheets for 60,000 steps.
Software Dependencies No The paper mentions software like "BERT-based models", "GPT", "Baichuan2", and "Chat GPT", and notes the use of "Lo RA (Hu et al. 2022)". However, it does not provide specific version numbers for any of these software components (e.g., PyTorch version, specific BERT library version).
Experiment Setup Yes We fine-tune each model separately on the three domains for 5000 steps, with the batch size selected from {32, 128} and learning rate from {2e-5, 5e-5}. We train the model with the batch size 4096 and learning rate 5e-5 on 8 A800 sheets for 60,000 steps.