MASKER: Masked Keyword Regularization for Reliable Text Classification

Authors: Seung Jun Moon, Sangwoo Mo, Kimin Lee, Jaeho Lee, Jinwoo Shin13578-13586

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental When applied to various pre-trained language models (e.g., BERT, Ro BERTa, and ALBERT), we demonstrate that MASKER improves OOD detection and cross-domain generalization without degrading classification accuracy. Code is available at https://github.com/alinlab/MASKER.
Researcher Affiliation Academia 1Korea Advanced Institute of Science and Technology, South Korea 2University of California, Berkeley, USA
Pseudocode No The paper refers to 'the overall procedure is in Appendix B' but the provided text does not include Appendix B, and no pseudocode or algorithm blocks are present in the main body.
Open Source Code Yes Code is available at https://github.com/alinlab/MASKER.
Open Datasets Yes We conduct OOD detection experiments on 20 Newsgroups (Lang 1995), Amazon 50 class reviews (Chen and Liu 2014), Reuters (Lewis et al. 2004), IMDB (Maas et al. 2011), SST-2 (Socher et al. 2013), and Fine Food (Mc Auley and Leskovec 2013) datasets, and cross-domain generalization experiments on sentiment analysis (Maas et al. 2011; Socher et al. 2013; Mc Auley and Leskovec 2013), natural language inference (Williams, Nangia, and Bowman 2017), and semantic textual similarity (Wang et al. 2019) tasks.
Dataset Splits Yes We use 25% of classes as in-distribution, and the rest as OOD. ... Figure 5: t-SNE plots on the document embeddings of BERT and MASKER, on (a,b) OOD detection (Amazon 50 class reviews with split ratio 25%), and (c,d) cross-domain generalization (Fine Food to SST-2).
Hardware Specification No The paper does not specify the hardware used for experiments, such as CPU or GPU models, or cloud computing instances.
Software Dependencies No The paper mentions using pre-trained models like BERT, RoBERTa, and ALBERT, but it does not specify software versions (e.g., PyTorch 1.x, TensorFlow 2.x, CUDA 11.x).
Experiment Setup Yes We choose 10 C keywords in a class agnostic way, where C is the number of classes. We drop the keywords and contexts with probability p = 0.5 and q = 0.9 for all our experiments. We use λMKR = 0.001 and λMER = 0.001 for OOD detection, and same λMKR = 0.001 but λMER = 0.0001 for cross-domain generalization, as the entropy regularization gives more gain for reliability than accuracy (Pereyra et al. 2017). We modify the hyperparameter settings of the pre-trained models (Devlin et al. 2019; Liu et al. 2019), specified in Appendix A.