Merging Statistical Feature via Adaptive Gate for Improved Text Classification

Authors: Xianming Li, Zongxi Li, Haoran Xie, Qing Li13288-13296

AAAI 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments on datasets of various scales show that, by incorporating statistical information, AGN can improve the classification performance of CNN, RNN, Transformer, and Bert based models effectively.
Researcher Affiliation Collaboration Xianming Li, 1 Ant Group, Shanghai, China 2 Department of Computer Science, City University of Hong Kong, Hong Kong SAR 3 Department of Computing and Decision Sciences, Lingnan University, Hong Kong SAR 4 Department of Computing, The Hong Kong Polytechnic University, Hong Kong SAR
Pseudocode No No pseudocode or clearly labeled algorithm block was found in the paper.
Open Source Code Yes Code available at https://github.com/4AI/AGN
Open Datasets Yes We test the proposed model on the following datasets (with summary statistics in Table 2). Subj3 (Pang and Lee 2004) is a dataset of subjectivity. SST-14 (Socher et al. 2013) is the Stanford Sentiment Treebank dataset... TREC5 (Li and Roth 2002)... AG s News6 (Zhang, Zhao, and Le Cun 2015)... Yelp Review Full (Yelp F.)7
Dataset Splits Yes Subj (Pang and Lee 2004)... We deploy 10-fold cross-validation on the dataset without standard train/test split (i.e., Subj). For datasets with standard split, we run ten trials and report the average results.
Hardware Specification Yes a CNN+AGN only requires 3, 250 additional parameters and 0.13 second more per epoch on training time, compared with a standard Text CNN (on SST-2 with an RTX 2080 Ti GPU).
Software Dependencies No No specific software dependencies with version numbers (e.g., Python 3.8, PyTorch 1.9) were explicitly stated.
Experiment Setup Yes The CNN-based models have a filter size of [3, 4, 5] with 100 filters of each, and the RNN-based models have hidden dimension of 128. For the Transformer, we use an encoder with 8 heads and 3 blocks. The employed Bert model is the Bert-base Uncased, including 12 layers, 768 hidden units, and 110M parameters. We adopt Adam optimizer with a batch size of 64 for non-Bert models and 16 for Bert models. The dropout rate is set to 0.5.