Multi-Task Self-Supervised Learning for Disfluency Detection

Authors: Shaolei Wang, Wangxiang Che, Qi Liu, Pengda Qin, Ting Liu, William Yang Wang9193-9200

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on the commonly used English Switchboard test set show that our approach can achieve competitive performance compared to the previous systems (trained using the full dataset) by using less than 1% (1000 sentences) of the training data. Our method trained on the full dataset significantly outperforms previous methods, reducing the error by 21% on English Switchboard.
Researcher Affiliation Academia Shaolei Wang,1 Wanxiang Che,1 Qi Liu,2 Pengda Qin,3 Ting Liu,1 William Yang Wang4 1Center for Social Computing and Information Retrieval, Harbin Institute of Technology, China 2University of Oxford 3Beijing University of Posts and Telecommunications, China 4University of California, Santa Barbara, CA, USA
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper states: 'We directly use the code released by Wang et al. (2017).3' and provides a link: 'https://github.com/hitwsl/transition_disfluency'. This refers to the code for a baseline system, not the open-source code for the methodology proposed in this paper.
Open Datasets Yes English Switchboard (SWBD) (Godfrey, Holliman, and Mc Daniel 1992) is the standard and largest (1.73 105 sentences for training ) corpus used for disfluency detection. We use English Switchboard as main data.
Dataset Splits Yes Following the experiment settings in Charniak and Johnson (2001), we split the Switchboard corpus into train, dev and test set as follows: train data consists of all sw[23] .dff files, dev data consists of all sw4[5-9] .dff files and test data consists of all sw4[0-1] .dff files.
Hardware Specification No The paper does not provide specific hardware details (e.g., CPU/GPU models, memory, or cloud instance types) used for running the experiments. It only generally mentions the model parameters were 'limited by devices' in comparison with BERT.
Software Dependencies No The paper mentions using a 'transformer architecture', 'GELU activations (Hendrycks and Gimpel 2016)', and the 'Adam optimizer'. However, it does not specify version numbers for any software libraries, frameworks (e.g., PyTorch, TensorFlow), or programming languages used.
Experiment Setup Yes In all experiments including the transformer-based baseline and our self-supervised method, we use a transformer architecture with 512 hidden units, 8 heads, 6 hidden layers, GELU activations (Hendrycks and Gimpel 2016), and a dropout of 0.1. We train our models with the Adam optimizer. For the joint tagging and sentence classification objectives, we use streams of 128 tokens and a mini-batches of size 256. We use learning rate of 1e-4 and epoch of 30. When fine-tuning on gold disfluency detection data, most model hyperparameters are the same as in pre-training, with the exception of the batch size, learning rate, and number of training epochs. We use batch size of 32, learning rate of 1e-5, and epoch of 20.